U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Hartling L, Hamm M, Milne A, et al. Validity and Inter-Rater Reliability Testing of Quality Assessment Instruments [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2012 Mar.

Cover of Validity and Inter-Rater Reliability Testing of Quality Assessment Instruments

Validity and Inter-Rater Reliability Testing of Quality Assessment Instruments [Internet].

Show details

Executive Summary

Introduction

The assessment of methodological quality, or risk of bias, of studies included in a systematic review is a key step and serves to: (1) identify the strengths and limitations of the included studies; (2) investigate, and potentially explain, heterogeneity in findings across different studies included in a systematic review; and, (3) grade the strength of evidence for a given question. There are numerous tools to assess methodological quality, or risk of bias, of primary studies; however, few have undergone extensive inter-rater reliability or validity testing. Therefore it is unknown whether, or to what extent, the summary assessments based on these tools differentiate between studies with biased and unbiased results.

There is a need for inter-rater reliability testing of different tools in order to assess and enhance consistency in their application and interpretation across different systematic reviews. Further, validity testing is essential to ensure that the tools being used can identify studies with biased results. Finally, there is a need to determine inter-rater reliability and validity in order to support the uptake and use of individual tools that are being recommended for use by the systematic review community, and specifically the Cochrane Risk of Bias (ROB) tool within the Evidence-based Practice (EPC) Program.

Key Questions

The objective of this project was to assess the reliability and validity of quality assessment tools across individual raters and pairs of raters in evaluating study quality in comparative effectiveness reviews and other evidence reports produced through the AHRQ Effective Health Care (EHC) Program. In this work we focused on the Cochrane ROB tool and the Newcastle-Ottawa Scale (NOS). Both are recommended and frequently used in systematic reviews of randomized controlled trials (RCTs) and cohort studies, respectively.

The specific objectives were:

  1. To assess the reliability of the Cochrane ROB tool for RCTs and the NOS for cohort studies between individual raters and, for the ROB tool, between the consensus agreements of individual raters (i.e., comparing consensus agreements across four EPCs).
  2. To assess the validity of the Cochrane ROB tool and NOS by examining whether treatment effect estimates vary according to risk of bias or study quality. That is, were treatment effect estimates different for studies at high, unclear or low risk of bias based on the domains in the ROB tool, or for studies with different design characteristics based on the items of the NOS.
  3. To examine the impact of study-level factors (e.g., outcomes, interventions and conditions) on scale reliability and validity.

Methods

Cochrane Risk of Bias Tool and Randomized Controlled Trials

Study Selection: A sample of 154 RCTs was randomly selected from among 616 trials published in December 2006 that were previously examined for quality of reporting.1

Risk of Bias Assessments: We pilot tested the ROB tool and developed decision rules to accompany the guidance for applying the tool that is publicly available in the Cochrane Handbook.2 The tool was applied to each study independently by two reviewers. To assess reliability between consensus agreements, we used a subset of 30 trials. Two reviewers at each of the four collaborating EPCs independently assessed risk of bias and reached consensus. Table A provides an overview of the number of reviewers and number of studies for each component of this study.

Table A. Overview of study components.

Table A

Overview of study components.

Data Extraction: We extracted data on the primary outcome for each trial. Several characteristics of the trial that may be related to risk of bias were extracted, including study type (efficacy, equivalence), study design (parallel, crossover), the condition being treated, type of outcome (subjective, objective), nature of the intervention (pharmacological, nonpharmacological), treatment mode (flexible dose vs. fixed dose), treatment duration, baseline mean difference between study groups for continuous outcomes, the impact of the intervention (treatment effect sizes [ES]), variance in ES, sample size, and funding source. Data extraction for each study was completed by a single reviewer. A 10 percent random sample of trials was checked by a second reviewer.

Data Analysis: For the entire sample of trials, inter-rater agreement between two reviewers was calculated for each domain using the weighted kappa statistic. Agreement was categorized as poor, slight, fair, moderate, substantial, or almost perfect using accepted approaches (Table B).3 Using subgroup analyses, we explored whether inter-rater agreement was influenced by study-level factors, including study design, study hypothesis, nature of the intervention, nature of the outcome, and source of funding. For this purpose, kappas were compared using p-values computed from standard errors and the central limit theorem. For the subset of 30 studies, agreement for consensus assessments across pairs of reviewers was measured using Fleiss' kappa statistic (i.e., the consensus assessments were compared across the pairs of reviewers from four EPCs).

Table B. Interpretation of Fleiss' kappa (κ) (from Landis and Koch 1977).

Table B

Interpretation of Fleiss' kappa (κ) (from Landis and Koch 1977).

Since there is no gold standard against which the validity of the ROB assessments can be made, we operationalized construct validity as differences in treatment ES across risk of bias categories (high, unclear, low). For each RCT we calculated an ES for the primary outcome. ES were calculated using Hedge's g for continuous outcomes. Odds ratios were calculated for dichotomous outcomes and converted into ES using a formula described by Chinn.4 The ES for all RCTs were combined using a random effects model.5 We compared the pooled ES for the high, unclear, and low risk of bias categories for each of the six domains and overall risk of bias. The differences were compared using a random effects meta-regression model.

The effect of specific covariates on risk of bias was analyzed using logistic regression. We also tested these covariates for their effect on the association between risk of bias and ES in a subgroup analysis. The covariates examined were intervention type (pharmacological or nonpharmacological), study design (parallel vs. other), funding source (industry vs. other), type of trial (efficacy/superiority vs. other), and type of outcome (subjective or objective).

Newcastle-Ottawa Scale and Cohort Studies

Study selection: We identified completed meta-analyses of cohort studies through the EPC Program and Medline. A meta-analysis was considered appropriate to include if it incorporated at least 10 studies, assessed a dichotomous outcome, and had substantial statistical heterogeneity (i.e., I2 > 50 percent). Our final sample included 131 cohort studies from 8 meta-analyses.

Quality Assessments: We pilot tested the NOS and developed decision rules to accompany existing guidance for the NOS. All studies were assessed using the NOS independently by two reviewers. Discrepancies were resolved through discussion to produce consensus assessments for each study.

Data Extraction: The outcomes and data for effect estimates were based on the meta-analyses and checked against the primary studies by a single reviewer. The statistician double-checked data that were unclear.

Data Analysis: Inter-rater agreement was calculated for each domain and for overall quality assessment using weighted or unweighted kappa statistics, as appropriate. Agreement was categorized as above (B). For the results of the individual meta-analyses, we coded endpoints consistently so that the outcome occurrence was undesired. Within each meta-analysis, we generated a ratio of odds ratios (i.e., odds ratios for studies with and without the domain of interest, or of high/low quality as assessed by the NOS). To maintain consistency, we used odds ratios to summarize all meta-analyses, even if this was not the statistic that was used in the original meta-analysis. The ratios of odds ratios for each meta-analysis were combined to give an overall estimate of differences in effect estimates using meta-analytic techniques with inverse-variance weighting and a random effects model.6

Results

Results are presented according to the tools we examined: ROB tool for RCTs and NOS for cohort studies.

Risk of Bias and Randomized Controlled Trials

Description of reviewers: Twelve reviewers from two EPCs assessed the RCTs using the ROB tool. Individuals had varying levels of relevant training (10 of 12 had formal training in systematic reviews) and experience with EPC work specifically (9 months to 10 years). For the subset of 30 RCTs, two reviewers from each of the four EPCs applied the ROB tool and reached consensus for each study. The length of time they had worked with an EPC ranged from 2 to 10 years. Six reviewers had formal training in systematic reviews.

Description of sample: We included 154 RCTs: 124 were used to assess inter-rater reliability between two reviewers, and a random sample of 30 was used to assess reliability across pairs of reviewers. The vast majority of trials had overall risk of bias assessments of high (46.8 percent) or unclear (52.6 percent) with only one trial assessed as low risk of bias overall (0.7 percent).

Inter-rater reliability: Inter-rater reliability for the RCTs is presented by domain in Table C. Sequence generation had the highest level of agreement, which was considered substantial. Reliability for the remaining domains was fair.

Table C. Inter-rater reliability on risk of bias assessments, by domain.

Table C

Inter-rater reliability on risk of bias assessments, by domain.

A random sample of 30 studies was selected to compare consensus assessments across pairs of reviewers from the four participating EPCs. The results are detailed in Table C. There was moderate agreement for sequence generation, fair agreement for allocation concealment and “other sources of bias,” and slight agreement for the remaining domains and overall risk of bias.

We assessed whether important study-level variables influenced inter-rater reliability. Table D provides a summary of significant findings. These results should be considered exploratory, but provide some direction for developing further guidance to improve inter-rater reliability of ROB assessments.

Table D. Summary of study-level variables and influence on inter-rater reliability.

Table D

Summary of study-level variables and influence on inter-rater reliability.

As a post hoc exercise, we reviewed the disagreements to identify whether they stemmed primarily from reviewers identifying different information in the study reports, or from different interpretation of the criteria. In general, reviewers identified similar information for the domains of sequence generation, allocation concealment, blinding, selective outcome reporting, and “other sources of bias;” however, their judgments for risk of bias based on the same information varied. For incomplete outcome data and for assessing baseline imbalances within “other sources of bias,” reviewers often based their assessments on different information that they extracted from the study.

Validity: No statistically significant differences were found in ES across the six domain-specific and overall risk of bias categories.

Newcastle-Ottawa Scale and Cohort Studies

Description of Reviewers: Sixteen reviewers assessed studies using the NOS. Individuals had varying levels of relevant training (13 had formal training in systematic reviews), and experience with EPC work (4 months to 10 years).

Inter-rater reliability: Inter-rater reliability for the 131 cohort studies is presented by domain in Table E, and ranged from poor to substantial.

Table E. Inter-rater reliability on NOS assessments, by domain.

Table E

Inter-rater reliability on NOS assessments, by domain.

In general, the reviewers found the tool difficult to use and found the decision rules vague even with the additional information we provided as part of this study. General points that arose were whether to assess each study based on the individual report, or as it related to the systematic review question. For example, a cohort study might have reported/assessed comparability between exposed and nonexposed that it was designed to investigate. It may have also reported (subgroup) outcome data, without reporting corresponding baseline comparability data, by presence or absence of a covariate determining exposure and nonexposure for the meta-analysis of interest. Similarly, there was uncertainty about whether to base assessments on the information contained in the specific study report, or whether to incorporate information from other reports of the same study.

Response options on the NOS caused discordance among reviewers. They found it difficult to determine the difference between some response options (e.g., “truly” vs. “somewhat” representative study population). Furthermore, the importance of the distinction between certain categories was unclear. In some domains multiple responses garnered a star (i.e., a point in the overall score), hence there was no difference in the final score. Reviewers experienced difficulty in interpreting the terminology (e.g., “selected” population) and in some cases the differences between categories were difficult to distinguish (e.g., “structured interview” vs. “written self-report”).

Reviewers also expressed uncertainty regarding the item assessing comparability, unsure whether to indicate that the study controlled for a given confounder if it was not included in the final model due to lack of significance in preliminary analyses. Reviewers expressed uncertainty regarding what some of the domains actually measured (e.g., selection bias vs. applicability). Further, some concerns were raised that the response categories within a domain measured different constructs.

Reviewers commented that they would have liked “unclear” or “no description” options for some of the items.

Validity: We found no association between individual NOS items or overall NOS score and effect estimates.

Summary and Discussion

Key Points

Risk of Bias Tool and Randomized Controlled Trials

  • Inter-rater reliability between reviewers was fair for all domains except sequence generation, which was substantial.
  • Inter-rater reliability between pairs of reviewers was moderate for sequence generation, fair for allocation concealment and “other sources of bias,” and slight for the remaining domains.
  • Low agreement between reviewers suggests the need for more specific guidance regarding interpretation and application of the ROB tool or possibly re-phrasing of items for clarity.
  • Examination of study-level variables and their association with inter-rater agreement identified areas that require specific guidance in applying the ROB tool. For example, nature of the outcome (objective vs. subjective), study design (parallel vs. other), and trial hypothesis (efficacy/superiority vs. other).
  • Low agreement between pairs of reviewers indicates the potential for inconsistent application and interpretation of the ROB tool across different groups and systematic reviews.
  • Post hoc analyses showed that discrepancies between reviewers most often arose from interpretation of the tool rather than discrepancies in the information that was extracted from studies.
  • No statistically significant differences were found in ES across high, unclear, and low risk of bias categories. Moreover, most RCTs in the sample were assessed as high or unclear risk of bias for many domains.

Newcastle-Ottawa Scale and Cohort Studies

  • Inter-rater reliability between reviewers ranged from poor to substantial, but was poor or fair for the majority of domains.
  • No association was found between individual quality domains and measures of association.

Discussion

Risk of Bias Tool and Randomized Controlled Trials: We found that inter-rater reliability between reviewers was low for all but one domain in the ROB tool. These findings are similar to results of previous research.7,8 The sample of trials used in this study was not part of a systematic review, rather they were trials randomly selected from a larger pool. Hence, the trials covered a wide range of topics. This may have contributed to some of the low agreement as reviewers had to consider different nuances for each trial. Previous research has demonstrated greater agreement within the context of a systematic review where all trials examined the same interventions in similar populations.7,8 Nevertheless, the low agreement points to the need for clear and detailed guidance in terms of applying the ROB tool. One of the unique contributions of the present study was the analysis of inter-rater reliability stratified by study-level variables. This provides some direction for where more specific guidance may be beneficial. For example, agreement was considerably lower for allocation concealment when trials did not have a parallel design, and blinding when the nature of the outcome was subjective. Agreement may be better in classic parallel trials of pharmacological interventions, whereas trials with different design features (e.g., crossover) or hypotheses (e.g., equivalence, noninferiority), and those examining nonpharmacological interventions appear to introduce more ambiguity for risk of bias assessments.

Another unique contribution of the present study was the examination of the consensus ratings across pairs of reviewers. These ratings should be free of individual rater errors and bias given that these are combined ratings with disagreements resolved (assuming consensus was based on joint decisionmaking and not deference to the more senior reviewer). Further, this is a more meaningful measure of agreement (as opposed to reliability between two reviewers), as these ratings are the ones reported in systematic reviews. In this study, the pairs of reviewers were from four different centers, each with a long history of producing systematic reviews. The agreement across the pairs of reviewers was generally lower than the agreement between reviewers. This raises concerns about the variability in interpreting and applying the ROB tool that can occur across different groups and across systematic reviews. Further, we found that discrepancies more often resulted from interpretation of the tool rather than different information being identified and recorded for the same study.

Overall risk of bias was high or unclear in 99 percent of the studies used for this research. This is consistent with other studies where the vast majority of trials have been assessed as high or unclear risk of bias overall. If the majority of trials are assessed as high or unclear risk of bias, the ROB tool may not be sensitive to differences in methodology that might explain variation in treatment effect estimates across studies (e.g., study methodology as a potential explanation for heterogeneity in meta-analyses). Questions also arise regarding whether assessments of poor quality are a result of inadequate or unclear reporting at the trial level. While the focus of the ROB tool is intended to be on methods rather than reporting, reviewers regularly indicate that they rely on the trial reporting to make their assessments. Even within recent samples of trials that were published after the emergence and widespread dissemination of reporting guidelines, we see high proportions assessed as high or unclear risk of bias.

We found no statistically significant association between effect estimates and risk of bias assessments. There are three main explanations for this finding. The first is that there was in fact no association between effect estimates and risk of bias. The second is that bias can either underestimate or overestimate treatment effects; hence, when studies were combined the association may have cancelled out. The first two explanations may have resulted in part from the sample of studies selected for this study. Third, and possibly most likely, is that there was insufficient power to detect differences. One of the factors contributing to low power was the small number of studies within certain domains in the low risk of bias category.

Newcastle-Ottawa Scale and Cohort Studies: This is the first study to our knowledge that has examined inter-rater reliability and construct validity of the NOS. We found a wide range in the degree of agreement across the domains of the NOS, ranging from slight to substantial. The domain with substantial agreement was not surprising. This domain asked “was the followup long enough for the outcome to occur?” A priori we asked clinical experts to provide the minimum length of followup for each review question. Thus, the assessors had very specific guidance for this item. The agreement for ascertainment of exposure and assessment of outcome was moderate, suggesting that the wording and response options are reasonable. The remaining items had poor, slight, or fair agreement.

We found no association between NOS items and the measures of association using meta-epidemiological methods that control for heterogeneity due to condition and intervention.

Implications for Practice

The findings of this research have important implications for practice and the interpretation of evidence. The low level of agreement between reviewers and pairs of reviewers puts into question the validity of risk of bias/quality assessments made with the ROB tool and the NOS within any given systematic review. Moreover, in measurement theory, reliability is a necessary condition for validity (i.e., without being reliable a test cannot be valid). Systematic reviewers are urged to incorporate considerations of risk of bias/quality into their results. Furthermore, integration of the GRADE tool into systematic reviews necessitates the consideration of risk of bias/quality assessments in rating the strength of evidence and ultimately recommendations for practice. The results and their interpretation in a systematic review will be misleading if they are based on flawed assessments of risk of bias/quality. Moreover, variability across reviewers and review groups may produce arbitrary results.

There is an urgent need for more detailed guidance to apply these tools. In the meantime, reviewers and review teams need to be aware of the limitations of existing tools. Detailed guidelines, decision rules, and transparency are needed so that readers and end-users of systematic reviews can see how the tools were applied. Further, pilot testing and development of review-specific guidelines and decision rules should be mandatory and reported in detail.

The NOS in its current form does not appear to provide reliable quality assessments and requires further development and more detailed guidance. The NOS was previously endorsed by The Cochrane Collaboration; however, more recently the Collaboration has proposed a modified ROB tool to be used for nonrandomized studies. A new tool developed through the EPC Program for quality assessment of nonrandomized studies offers another alternative. These tools warrant further evaluation.

Future Research

There is a need for more detailed guidelines to apply the ROB tool and NOS, as well as revisions to both tools to enhance clarity. Additional testing should occur after revisions to the tools and when expanded guidelines are available. We have identified specific RCT features for which clearer guidance is needed. A living database that collects examples of risk of bias/assessments and consensus from a group of experts would be a valuable contribution to this field. We have identified specific problems with application and interpretation of the NOS. Further revisions and guidance are needed to support the continued use of NOS in systematic reviews. Investment in further reliability and validity testing of other tools is warranted (e.g., Cochrane ROB tool for nonrandomized studies, EPC quality assessment tool). Finally, consensus in this field is needed in terms of the threshold for inter-rater reliability of a measurement before it can be used for any purpose, even descriptive purposes (i.e., describing the risk of bias or quality of a set of studies).

Strengths and Limitations

This is one of few studies examining the reliability and validity of the ROB tool. It is the first to our knowledge that examines reliability between the consensus assessments of pairs of reviewers for a systematic review quality/ risk of bias assessment tool. Further, it is the first study to provide empirical evidence on study-level variables that may impact reliability of ROB assessments. This is the first study to our knowledge to examine the reliability and validity of the NOS.

The main limitation of the research is that the sample sizes (154 RCTs, 131 cohort studies) may not have provided sufficient power to detect statistically significant differences in effect estimates according to risk of bias/quality. Another potential limitation is that we did not use a ‘meta-epidemiological approach’ (i.e., reanalysis of data from existing meta-analyses)6 to examine the association between effect estimates and risk of bias, therefore the heterogeneity across trials may have limited our ability to detect differences. We involved a number of reviewers with different levels of training, type of training, and extent of experience in quality assessment and systematic reviews. Some of the variability or low agreement may be attributable to characteristics of the reviewers. Nevertheless, all reviewers had previous experience in systematic reviews and quality assessments, and likely represent the range of individuals that would typically be involved in these activities within a systematic review.

A final caveat to note is that the ROB tool has undergone some revisions since we initiated the study. These are detailed in the most recent version of the Cochrane Handbook but were not incorporated into our research. This does not impact the general findings from our research; however, further testing with the modified tool is warranted.

Conclusions

More specific guidance is needed to apply and interpret risk of bias/quality tools. We identified a number of study-level factors that influence agreement. This information provides direction for more detailed guidance. Low agreement between reviewers has implications for incorporation of risk of bias into results and grading the strength of evidence. Low agreement across pairs of reviewers has implications for interpretation of evidence reported by different groups. There was variable agreement across items in the NOS. This finding, combined with a lack of evidence that it discriminates studies that may provide biased results, underscores the need for more detailed guidance to apply the tool in systematic reviews.

References

1.
Hopewell S, Dutton S, Yu LM, et al. The quality of reports of randomised trials in 2000 and 2006: comparative study of articles indexed in PubMed. BMJ. 2010;340:c723. [PMC free article: PMC2844941] [PubMed: 20332510]
2.
Higgins JP, Thompson SG, Deeks JJ, et al. Measuring inconsistency in meta-analyses. BMJ. 2003;327(7414):557–60. [PMC free article: PMC192859] [PubMed: 12958120]
3.
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–74. [PubMed: 843571]
4.
Chinn S. A simple method for converting an odds ratio to effect size for use in meta-analysis. Stat Med. 2000;19(22):3127–31. [PubMed: 11113947]
5.
DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials. 1986;7(3):177–88. [PubMed: 3802833]
6.
Sterne JA, Juni P, Schulz KF, et al. Statistical methods for assessing the influence of study characteristics on treatment effects in ‘meta-epidemiological’ research. Stat Med. 2002;21(11):1513–24. [PubMed: 12111917]
7.
Hartling L, Ospina M, Liang Y, et al. Risk of bias versus quality assessment of randomised controlled trials: cross sectional study. BMJ. 2009;339:b4012. [PMC free article: PMC2764034] [PubMed: 19841007]
8.
Hartling L, Bond K, Vandermeer B, et al. Applying the risk of bias tool in a systematic review of combination long-acting beta-agonists and inhaled corticosteroids for persistent asthma. PLoS One. 2011;6(2):e17242. [PMC free article: PMC3044729] [PubMed: 21390219]

Views

Related information

  • PMC
    PubMed Central citations
  • PubMed
    Links to PubMed

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...