Main Findings

In this pilot project, we performed quality assessments of randomized controlled trials (RCTs) in three consecutive exercises, aiming to gain insights into the variability of quality appraisal and identify potentially influential study characteristics in determining a study’s overall quality rating.

Concordance or Discordance in Quality Rating by Independent Reviewers

The quality assessments performed in exercise 1 of this project were done without any instructions to the reviewers, thus mirroring the pilot quality assessments process during the initial phases of conducting an Evidence-based Practice Center (EPC) evidence report and before reaching consensus on the content-specific considerations for quality assessments on a given topic. These assessments demonstrated remarkable inter-rater variations in overall study ratings as well as individual quality items. Disagreements among reviewers ranged from 0 to 100 percent, indicating variable subjectivity across these items. By reviewing the narrative descriptions provided by the reviewers with regard to their thought processes in making these assessments, we identified a series of items not normally captured by the standard template quality item checklist but that may still play a role in determining quality ratings. Some of these items (e.g., power calculations, multiplicity of testing, protocol modifications during the trial) are applicable in RCTs across various research topics. Operational definitions could be developed for these items and subsequently considered for routine incorporation in quality item checklists. However, certain items were specific to the study-design or the research topic reviewed, that is, items specific to crossover RCTs commonly used in the sleep apnea literature, such as the presence of a washout period between two different treatment periods. Such content-specific items could be discussed with technical experts during the initial phases of an evidence report and incorporated in a project-specific quality item checklist, should they be deemed important. Although no specific analysis of narrative summaries was done, review of these summaries suggested that study reviewers may often rely on general impressions when assigning quality ratings.

Residual Variation in Quality Assessments After Standardizing Some Quality Item Definitions

Unlike in exercise 1, in exercise 2, the reviewers were guided by a list of explicit definitions of quality items. This “calibration” effort reduced inter-rater variability for certain items in the second sample, although disagreements were more common in exercise 2 than in exercise 1 for others, such as “definition of outcomes” and “eligibility criteria”; this finding suggests that assessments of adequacy of reporting of study characteristics may have greater variability, or that the second literature sample included less clearly reported studies. Additionally, the item for “blinded patient” demonstrated higher degree of disagreement in exercise 2 than in exercise 1. This happened possibly because blinding of patients was possible only for the clinical context of the RCTs considered in exercise 2. This observation emphasizes the importance of pilot testing within reviews and establishing context specific definitions or decision rules. It was also noteworthy that the variations in responses for specific items (e.g., presence of selection bias, blinding, dropout rates, clear reporting) did not diminish in the second sample after standardization with explicit definitions. Such items were considered to have a more subjective interpretation and could thus account for the differences in the overall quality ratings observed between the reviewers (disagreement in 55 percent of the studies in exercise 2). Nevertheless, the differences between exercise 1 and 2 are not solely attributable to the fact that exercise 2 was performed under the guidance of the item definitions. Between the two exercises, the reviewers also convened, extensively discussed issues and acquired experience which may have influenced their performance in exercise 2. This “fine-tuning” and interactions between reviewers may be even more important than just receiving a set of instructions on how to assess a particular item.

Influential Methodological Determinants in the Final Quality Rating

By examining differences in the proportions of studies in which the specific quality items were satisfied (received a positive response) between “B” and “C” quality studies, we also aimed to gain an appreciation of the relative weight that these items bear on the overall rating. Given the small sample sizes analyzed here (N=11 RCTs in each exercise), no formal statistical comparisons were performed. However, we observed trends of divergence in the proportions of specific items, such as “other issues,” “eligibility criteria,” “clear reporting,” “blinded observers,” and “dropout rate <20 percent,” which provided an indication that these items may be influential. Nevertheless, not all of these items were amenable to sensitivity analyses, i.e., determining their baseline and alternative values and examining the impact of changes on quality ratings. Thus, we selected two of these items (“blinded observers” and “dropout rate <20 percent”) to implement sensitivity analyses in the third exercise.

Effects on Quality Ratings After the Provenance of a Paper Has Been Concealed and Influential Factors Like Study Dropout Rate and Blinding Have Been Modified in a Sensitivity Analysis Scheme

In exercise 3, we examined the synchronous impact of study anonymization and quality item sensitivity analyses. Previous work has shown that “blinding” reviewers to identifying information of synthesized papers in meta-analysis may not significantly change the summary estimate obtained from the reference “unblinded” meta-analysis.15,16 We observed that the creation of anonymized plain text documents in place of the publication-specific format articles resulted in considerable intra-rater variability, i.e., discrepancies in the responses for specific items for the same study and by the same reviewer. This may provide an indication that the format of the article presentation (and not the exact amount of information reported) may have an impact on the reviewers’ assessment of specific items, particularly those relating to clarity of reporting. It is also possible that the anonymization of papers may be a sufficient factor to modify study quality ratings, since reviewers are liberated from other potential implicit and subconsciously operating factors, such as journal of publication or authors’ names. However, the extent of intra-rater variability observed here may also reflect the fact that the reviewers assessed the same studies after a 3-week period. This repeated evaluation may have an inherent, baseline degree of variability, which can be further accentuated if studies with inadequate reporting are evaluated. In such instance, items with uncertainty about their values are likely to be differentially interpreted at distinct time points.

We were not able to assess the isolated impact of anonymization, given that the sensitivity analyses of two quality items were applied concurrently. Nevertheless, the sensitivity analyses with favorable and unfavorable changes in two items (“blinded observers” and “dropout rate <20 percent”) provided valuable insight. First, “C” quality studies were not upgraded in any case, indicating that other significant factors may have determined the studies’ “C” ratings, and that the reviewers felt strongly about the poor quality of these studies. Second, “B” studies were also resistant to upgrading by favorably “blinding” the outcome assessors in these studies. Finally, increasing dropout rates to >20 percent was a sufficient factor to downgrade 3 “B” quality studies to “C,” providing an indication that this item may be heavily influential. However, it should be noted that the 20 percent cutoff point as indication of “large” dropout is arbitrary and its interpretation may be content-specific. That is, for different type of interventions, a 20 percent dropout rate may not be interpreted as “large.”

Limitations

The results presented here are based on a small sample of RCTs, selected from a single comparative effectiveness review (CER) and assessed by three reviewers from one EPC only. The pilot testing of this quality assessment method was incorporated in the exercises performed in this project; nevertheless, this method represents the default approach in quality assessments performed at our EPC. The definitions of the items in our checklist were not evaluated for adequacy and clarity, other than for their face validity assessed by the reviewers of this study. We acknowledge that this default checklist may not be in widespread use across evidence synthesis practices, and is not directly aligned with the current trend to transfer the focus from methodological (and reporting) quality to explicit assessment of the risk of bias of studies. Due to these reasons, the generalizability and the target audience of this research activity may be limited. The selection of studies may have been inadequate in terms of diversity of their perceived quality ratings, since none of the included RCTs was rated as an “A” quality study in the original CER. Some hypotheses (i.e., anonymization of documents and sensitivity analyses) were examined concurrently. Furthermore, we did not examine how our quality assessment tool compared with other available tools or how our assessments would differ if applied in a different clinical question. Thus, our findings are preliminary only and no definite conclusions could and should be drawn from this pilot work.

Implications for Future Studies

Our findings highlighted the extensive variability in quality assessments with a tool that is based on a comprehensive checklist of items but without specific decision rules about the synthesis of items. It is unknown how the instrument we used would compare with others, such as the Cochrane risk of bias tool, in terms of inter- and intra-rater variability. More empirical data on larger sample sizes of RCTs (and other study design types) and number of reviewers can provide critical information on the reproducibility and reliability of quality assessments. Future research can perform formal comparisons of reliability between tools capturing a reviewer’s global impression of a study (like ours) versus tools with explicit decision rules and a smaller set of items (e.g., Jadad score or Cochrane risk of bias tool). The distinction between methodological and reporting quality is also of great importance and should be pursued further in future studies. The effects of some of the parameters examined in our study (e.g., anonymization of papers or providing instructions) would be more directly estimable through randomized experiment designs (e.g., by randomizing reviewers into assessments of published format versus anonymized papers). Such exercises should be considered by future studies examining quality assessments.

It is also plausible that the overall quality rating of a study may also be influenced by the quality rating of other studies that address the same key question. In other words, the quality rating may be more of a relative measure than an absolute measure of risk of bias. The relative thresholds for distinguishing different levels of risk of bias may also vary depending on clinical topics and questions at hand. Carefully planned exercises and analyses will be needed before these hypotheses could be tested.

Conclusions

In summary, we identified extensive variations in overall study ratings among three experienced reviewers. Our preliminary data indicate that single reviewer quality rating is at high risk for being different from subsequent independent evaluations of additional reviewers, as discrepancies among reviewers in the assignment of quality ratings (and individual items) are relatively common. While it may be desirable to have a single rating assessed by more than one reviewer using a process of reconciliation, in the absence of a gold standard method, it may be even more important to report the variations in assessments among different reviewers. A study engendering a large variability in quality assessment may fundamentally be very different from one that has little variations, despite the fact that both of them are assigned the same consensus quality rating. Further assessments are needed to investigate these hypotheses.

Key messages from Phase I of this project include:

  • Quality ratings assigned by three independent reviewers display remarkable variations.
  • Adjudications of individual quality items are commonly discordant. Specific quality items are more extensive in the variations of their responses.
  • Disagreements on overall quality ratings may not be directly reflective of different interpretations of individual items.
  • Items and issues beyond those captured in commonly used checklists may contribute significantly to a reviewers’ assessment of a study.
  • Explicit guidelines on the quality assessment process can attenuate variations in responses.
  • The relative contributions of various quality items are unknown and difficult to quantify. Specific items (e.g., blinding of outcome assessors or dropout rates) appear to be more influential than others.
  • Anonymization of published papers may impact reviewers’ assessments of quality items.

Our findings highlight the need for further empirical research on the inherent variability of study quality assessments. The common disagreements in individual quality components but also in overall quality ratings emphasize the problems with arriving at summary scores or judgments for study quality. The results for intra-rater discordances further highlight the problems and limit the reproducibility of quality assessments, at least for the types of studies and the settings examined here. Given all these considerations, the utility of overall quality ratings is debatable and may have to be revisited. Alternative approaches that simply report individual study limitations or point out those limitations that are felt to be most critical within a given topic can bypass the limitations imposed by lack of robustness in quality assessments.

Given that a quality rating is the end product of an implicit thought process rather than a formulaic approach of combining individual item assessments, it is not surprising that quantification of the contributions of specific quality items to a reviewer’s assessment is difficult. Our methodological approach provides an operational framework with which the inherent subjectivity of quality assessments can be analyzed and the relative contributions of items can be measured, provided that adequate data are gathered. Supplemented by a software tool for depositing and analyzing the data in a standardized format, a larger scale methodological project could provide quantitative insights into the quality rating process. Such an investigation could be potentially implemented as a cross-EPCs collaborative project, resulting in a large repository of quality assessment data through which EPC-related factors could also be investigated. The development of a transparent and robust process for quality assessment of the evidence synthesized in EPCs’ comparative effectiveness reviews will help decisionmakers appreciate the strengths and limitations of available evidence and thus reach more informed, better decisions.