Results

Stanley Ip; Georgios D Kitsios; Mei Chung; Joseph Lau

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Ip S, Kitsios GD, Chung M, et al. A Process for Robust and Transparent Rating of Study Quality: Phase 1 [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2011 Nov.

Cover of A Process for Robust and Transparent Rating of Study Quality: Phase 1

A Process for Robust and Transparent Rating of Study Quality: Phase 1 [Internet].

Show details

Contents

< Prev Next >

Results

Exercise 1

In this exercise, we found that the overall study quality ratings were discordant among the reviewers 64 percent of the time. Among the 33 assigned quality ratings, there were 1 “A,” 21 “B,” and 11 “C” assignments. For each randomized controlled trial (RCT), items were recorded as in “agreement” when all reviewers had marked it consistently; items in which at least one reviewer disagreed were marked as in “disagreement.” The proportion of “disagreement” was then calculated for each item across the 11 RCTs sample (that is, number of studies in which item assessments disagreed divided by the total number studies). The results of this first exercise are shown in Figure 1.

Figure 1

Proportions of disagreements for each of the quality items included in the quality item checklist and for overall study quality ratings in Exercise 1. Definit_outcomes = “Were the outcomes fully defined?”; Descr_interventions = “Were (more...)

Proportions of disagreements ranged from 0 percent (unanimous agreement among the reviewers in all studies) to 100 percent. As shown in Figure 1, the items “definition of outcomes,” “description of interventions,” and “blinding of patients” had 0 percent disagreements, indicating that the interpretation of these items was straightforward and possibly less prone to subjectivity. In contrast, the item “adjustment for confounders” had 100 percent disagreement, which might result from a consistently different interpretation from each of the reviewers rather than uncertainty within the underlying studies.

Following this analysis, we then compared the number of items with discrepancies between studies but with agreement in the overall quality rating (e.g., all reviewers assigning a “B”) versus studies in which there was a disagreement in the overall quality rating as well (e.g., two reviewers assigning a “B” and one reviewer a “C”) (Figure 2).

Figure 2

Number of items with discrepancies between studies with agreement versus studies with disagreement in the overall quality rating in Exercise 1.

No obvious difference was observed in the numbers of discrepancies, providing a first indication that the overall assigned rating may not be a direct reflection of the specific quality items. However, we were unable to identify the specific items that were the main determinants of the overall quality rating.

The narrative descriptions of the study quality ratings were also jointly reviewed and evaluated as part of Exercise 1. We summarized the quality issues that were raised by the reviewers but were not captured by the quality item checklist in the following categories:

Study design aspects: Power calculations, multiple testing, selective inclusion criteria, multiple subgroup analyses.
Study execution aspects: Treatment changes during the trial, protocol changes during the trial, lack of uniform recording of outcomes in all patients, early termination, different types of interventions used in the same arm.
Study reporting aspects: Unclear recruitment method (population before applying inclusion/exclusion criteria).
Aspects specific to crossover RCTs: Inadequate washout, no statistical testing for treatment by period interactions.

Exercise 2

In exercise 2, we evaluated a second set of 11 RCTs under the guidance of the quality item definitions checklist (Table 3). The results of this second exercise of quality assessments are juxtaposed with the results of the first exercise in Figure 3. The assigned quality ratings for each study by all three reviewers are shown in Table 4. Overall study quality ratings were discordant in 55 percent of the cases. The proportion of disagreements in the “adjustment for confounders” item went from 100 percent to 9 percent, confirming our initial hypothesis that this item received consistently different responses due to similar interpretation but consistently different coding of the response by the reviewers. We noticed that a similar pattern of reduction in the proportions of disagreements was observed for the items “ITT,” “Appropriate Statistics,” and “Accounting for center effects.” Thus, following standardization with the item definitions, the variability of responses was reduced. However, we observed that certain items had similar proportions of disagreement in exercises 1 and 2, which indicated that standardization with quality item definitions did not affect the variability of responses for these items. These items included “selection bias,” “blinded observer,” and “clear reporting.” Additionally, certain items appeared to have higher proportions of disagreement in exercise 2 (“blinded patient,” “definition of outcomes,” and “eligibility criteria”). This increase proportions of disagreement were found to be context-specific in the case of the “blinded patient” item, as the RCTs in exercise 2 compared fixed continuous positive airway pressure (CPAP) with auto-titrated CPAP, a comparison in which patients can be blinded to the applied airway pressures, whereas in exercise 1, which compared CPAP with mandibular advancement devices, blinding was not possible because two totally different devices were used.

Figure 3

Proportions of disagreements for each of the quality items included in the quality item checklist and for overall study quality ratings*. *Proportions are compared across exercises 1 and 2. Definit_outcomes = “Were the outcomes fully defined?”; (more...)

Table 4

Assigned quality ratings for all studies analyzed in exercises 1 and 2 by each reviewer.

Similar to exercise 1, there was no obvious difference in the numbers of discrepancies between studies with agreement in the overall quality rating versus studies with disagreement in the overall quality rating (Figure 4).

Figure 4

Number of items with discrepancies between studies with agreement versus studies with disagreement in the overall quality rating in exercise 2.

In a further analysis, we examined the proportion of studies satisfying each quality item stratified by studies receiving a “B” versus a “C” overall quality rating (Figure 5). We observed that “B” quality studies almost always had a <20 percent dropout rate as compared with “C” quality studies, which had <20 percent dropout rates in only 46 percent of cases. Furthermore, appropriate statistical analyses were never present in “C” quality studies, and “B” quality studies had more commonly blinded observers. Additionally, “C” quality studies were more often identified as having a “selection bias” and the presence of “other issues” described in the narrative summaries.

Figure 5

Proportion of studies satisfying each quality item in studies with a “B” versus studies with a “C” overall quality rating. Definit_outcomes = “Were the outcomes fully defined?”; Descr_interventions = “Were (more...)

Exercise 3

Based on the above analyses, of the items with divergence in their proportions between “B” and “C” studies, we considered two items that could be modified: “Dropout rate <20 percent” (95 percent vs. 46 percent) and “Blinded observers” (45 percent vs. 8 percent). To further explore the potential impact of modifications of these two items, in exercise 3, a research assistant created plain text versions of the published manuscripts with any identifying information relating to the paper’s provenance (authorship, date published, etc.) removed. The text or numerical data of either the “Dropout rate <20 percent” or the “Blinded observers” items were then modified with alternative values. The changes introduced into the modified manuscript were made such that the new versions were indistinguishable from the originals, because all information had been transferred to text editing software and the reviewers were blinded to these changes. The specific types and locations of these changes are detailed in Table 5. Each study was assigned a numerical identifier for tracking purposes. Reviewers performed de novo and blinded quality assessments of these modified manuscripts.

Table 5

Changes introduced in the deidentified manuscripts.

If the anonymization of the manuscripts had no impact on the assessment of individual items and all other factors affecting a reviewer’s assessment were hold stable, perfect intra-rater agreement would have been expected for all items but for the two artificially modified ones. However, some degree of residual intra-rater variations must be expected because the same study was assessed at different time points, but we did not estimate this variability in exercise 3. Compared with the results of exercise 2, we would also expect similar proportions of inter-rater disagreement, particularly for the items that showed resistance to standardization with the definitions list.

Following breaking of the numerical identifier code, each reviewer’s assessments of all items (except the “Dropout rate <20 percent” and “Blinded observers” items) were compared for each study. The results of these intra-rater comparisons for all three reviewers are shown in Figure 6.

Figure 6

Comparisons of intra-rater disagreements for reviewers 1, 2 and 3. Definit_outcomes = “Were the outcomes fully defined?”; Descr_interventions = “Were interventions adequately described?”; Blind_Patient = “Blinded (more...)

We observed that for certain quality items (“description of eligibility criteria,” “description of interventions,” “adjustment for multiple centers,” “intention to treat analysis”), the observed intra-rater reliability (same quality item assessment in both exercises 2 and 3 for the same study) was satisfactory, as illustrated by the small proportions of disagreements for these items. In contrast, there was considerable within reviewer variations for multiple quality items relating to the methodology used in the included studies, such as “selection bias,” “adjustment for confounders,” “appropriate statistical analysis,” “blinded patients,” and “appropriate randomization.” Disagreement was also observed for items relating to the reporting of the RCTs, such as “definition of outcomes” and “clear reporting.” Such items could potentially be more vulnerable to differential assessments by the same reviewers as the clarity of reporting may be influenced once the format of the paper is modified. For the specific case of “adjustment for confounders,” implicit review of the data entries indicated that two reviewers consistently assigned reverse entries in the quality extraction forms in exercises 2 and 3, although their interpretation of this item was identical in both cases. In exercise 2, the reviewers had assigned a “Yes” to this item, as the studies are RCTs and confounders are considered to be controlled by randomization. However, in exercise 3, the reviewers entered a “Not applicable” response, implying that adjustment for confounders is not applicable to the case of RCTs. Thus, despite reaching the same conclusion for this item, the apparent disagreements were not genuine, as these discrepancies resulted from consistently discordant entries.

Comparing inter-rater disagreements between exercises 2 and 3 (Figure 7), we observed that reviewers were less often in disagreement regarding the overall study quality rating (54.5 percent in exercise 2 vs. 45.5 percent in exercise 3). Extensive variation was observed regarding the proportions of disagreements of the individual quality items. Some of these items showed a diminished extent of disagreement (e.g., “allocation concealment,” or the catch-all item “other issues” that included quality elements not captured in the checklist). However, for most items, anonymization of the papers resulted in increased proportion of disagreements for several items (e.g., “definition of outcomes,” “appropriate statistics”). We also observed that for certain items that have a less subjective interpretation (e.g., blinding of outcome assessors or patients), there was a consistent extent of disagreement between exercises 2 and 3.

Figure 7

Inter-rater disagreements in exercises 2 and 3. Definit_outcomes = “Were the outcomes fully defined?”; Descr_interventions = “Were interventions adequately described?”; Blind_Patient = “Blinded patient”; (more...)

Following these analyses, we then assessed the impact of the artificially introduced changes in either of the two selected items (“Dropout rate <20 percent” or “Blinded observers”) on the overall study quality rating.

First, consensus quality ratings were calculated for each of the analyzed RCTs (based on the quality assessments of exercise 2) either by unanimous agreement (e.g., all reviewers assigning “C”) or a majority vote (e.g., 2 of the 3 reviewers assigning “C”). Thus we obtained three “C” quality studies and eight “B” quality studies in exercise 2. The changes in the two items could be favorable or unfavorable, having the potential to upgrade or downgrade a study’s quality.

Four types of changes were introduced:

An initial dropout rate of <20 percent was changed to >20 percent (Unfavorable)
An initial dropout rate of >20 percent was changed to <20 percent (Favorable)
Initially blinded outcome assessors were changed to unblinded (Unfavorable)
Initially unblinded outcome assessors were changed to blinded (Favorable)

We examined the impact of these changes separately in studies rated as “C” quality (Figure 8) and “B” quality (Figure 9).

Figure 8

Impact of quality item changes in studies with “C” quality*. *Favorable changes are shown with an upward green arrow, and unfavorable changes are shown with a downward red arrow.

Figure 9

Impact of quality item changes in studies with “B” quality*. *Favorable changes are shown with an upward green arrow, and unfavorable changes are shown with a downward red arrow.

In the Galetke et al. study, a favorable change in the dropout rate was introduced and was detected by two of the three reviewers, but no change in the overall consensus study rating resulted.

In the Hudgel et al. study, a favorable change in blinding was introduced and was detected by all reviewers, but, again, with no impact on the overall quality rating. The unfavorable change in the dropout rate inserted in the Hussain et al. study had no impact on the study’s quality rating, as expected.

In the case of “B” quality studies, three favorable changes to the item “Blinded observers” were introduced (i.e., initially unblinded outcome assessors were changed to blinded). These changes were detected (i.e., the reviewers recognizing the artificial change introduced in exercise III) in five out of the nine instances and did not result in upgrading these “B” quality studies to “A” quality. In two studies, unfavorable changes to blinding were made, and these changes were detected in four out of the six reviews. Again, no impact on the overall study quality ratings was observed. In contrast, three unfavorable changes to dropout rates were introduced, and all resulted in downgrading the study quality ratings to “C.” These findings indicate that the proportion of dropouts may play an influential role in the assignment of quality ratings.

Bookshelf ID: NBK82253

Contents

< Prev Next >

PubReader
Print View
Cite this Page
Ip S, Kitsios GD, Chung M, et al. A Process for Robust and Transparent Rating of Study Quality: Phase 1 [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2011 Nov. Results.
PDF version of this title (228K)