U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Hartling L, Hamm M, Milne A, et al. Validity and Inter-Rater Reliability Testing of Quality Assessment Instruments [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2012 Mar.

Cover of Validity and Inter-Rater Reliability Testing of Quality Assessment Instruments

Validity and Inter-Rater Reliability Testing of Quality Assessment Instruments [Internet].

Show details

Methods

Steering Committee

A steering committee provided direction to the individual components of the project. The committee provided a similar function as the technical expert panel in evidence reports.

General Approach

We developed a protocol that detailed our methods prior to the start of the study. The protocol was reviewed by the Steering Committee and approved by AHRQ.

We proposed two different statistical approaches to assess validity against the treatment effect size (ES), which we consider as construct validity. The first approach is based on effect estimates from primary studies, while the second was a meta-epidemiologic approach; which controls for confounding and heterogeneity due to study-level factors (e.g., methodology, outcomes, interventions/exposures, and conditions). We chose the first approach for our analysis of RCTs so that our results could be compared directly to other related work, in particular to a similar analysis that was restricted to pediatric RCTs.38 We chose the second approach for the cohort studies because it is considered by some to be more methodologically robust.

Risk of Bias and Randomized Controlled Trials

Study Selection

A sample of 154 recently conducted RCTs involving adults was randomly selected from a set of 616 trials that were previously examined for quality of reporting by Hopewell and colleagues (Appendix A).44 We chose this sample as it presented several advantages including efficiencies in sample identification, as well as the potential for validation of assessments for key variables by comparing them with those of another independent study team. The original sample included all primary reports of RCTs that were indexed in PubMed in December 2006,44 therefore we feel that our sample was likely representative of RCTs published in the medical literature.

Conducting sample size calculations for this type of research is challenging and cannot be done using standard approaches to sample size calculations for other research designs, such as RCTs. There are a number of parameters required for sample size calculations that are presently unknown for research of this nature. Therefore, we used a pragmatic approach to determine sample size. This was based on previous studies in this area, input from the Steering Committee, and the availability of resources and timelines. We chose to select a 25 percent random sample of the 616 trials described above.

Risk of Bias Assessments

The ROB tool was applied to each study independently by two reviewers who had training and experience with the tool. A pool of reviewers was assembled from staff at the University of Alberta EPC and University of Ottawa EPC. To assess reliability between consensus agreements, we used a subset of 30 trials that were randomly selected from the sample of 154 trials described above. As above, the sample size for this subset was based on practical considerations, in particular the time available for two reviewers at each of the four EPCs. Two reviewers at each of the four collaborating EPCs independently assessed risk of bias and reached consensus (University of Alberta EPC, McMaster EPC, University of Ottawa EPC, Southern California/RAND EPC). Table 1 provides an overview of the number of reviewers and number of studies for each component of this study.

Table 1. Overview of study components.

Table 1

Overview of study components.

All reviewers involved in the project pilot tested the ROB tool. We applied the tool to five trials and met by teleconference to discuss any disagreements in general interpretation of the tool. Decision rules were developed to accompany the guidance for applying the tool that is publicly available in the Cochrane Handbook (Appendix B).1 It should be noted that the ROB tool has been slightly modified since we started this project in 2010 and new guidelines are available.11 In this project we used the original ROB tool. We planned for pilot testing of an additional sample of five trials if there was substantial disagreement. This was not deemed necessary after the initial pilot testing phase.

Data Extraction

For each trial, the primary outcome was identified and the data necessary to calculate ES were extracted. Several characteristics of the trial that may also be related to risk of bias/quality were extracted, including study type (efficacy, equivalence), study design (parallel, crossover), the condition being treated, nature of the intervention (pharmacological, nonpharmacological), treatment mode (flexible dose vs. fixed dose), treatment duration, type of outcome (subjective, objective), baseline mean difference between study groups for continuous outcomes, the impact of the intervention (treatment ES), variance in ES, sample size, and funding source (Appendix C). This list of variables was compiled prior to commencing data extraction with input from the Steering Committee.

Data extraction for each study was completed at the University of Alberta EPC by a single reviewer. A 10 percent random sample of trials with extracted data, including 10 percent of the trials assessed by each reviewer, was checked by a second reviewer. We planned to check an additional 10 percent if there were important or consistent errors, inaccuracies, or omissions. This was not deemed necessary, as there were few errors found.

Data Analysis

Reliability of the ROB tool. For the entire sample of trials, inter-rater agreement between two reviewers was calculated for each domain using the weighted kappa statistic as described by Liebetrau.45 Agreement was categorized as poor, slight, fair, moderate, substantial, or almost perfect using accepted approaches (Table 2).46 The individual kappa statistics for each ROB item are presented and summarized. For the subset of 30 studies, agreement for consensus assessments across pairs of reviewers was assessed using Fleiss' kappa statistics (i.e., the consensus assessments were compared across the pairs of reviewers from four EPCs).47

Table 2. Interpretation of Fleiss' kappa (κ) (from Landis and Koch 1977).

Table 2

Interpretation of Fleiss' kappa (κ) (from Landis and Koch 1977).

Validity of the ROB tool. Since there is no gold standard against which the validity of the ROB assessments can be made, we operationalized construct validity as differences in treatment ES across risk of bias categories (high, unclear, low).

For each RCT we calculated an ES for the primary outcome. If the primary outcome was not stated by the authors, a series of decision rules were followed. First, an objective outcome was selected over a subjective outcome; second, an outcome used as the basis for a sample size calculation was considered the primary outcome; and third, if neither criterion was met, the first outcome listed in the Results section was selected. ES were calculated using Hedge's g for continuous outcomes.48 Odds ratios were calculated for dichotomous outcomes and converted into ES using the following formula:49

(3π)ln(OR)

The ES from all RCTs were then combined using a random effects model.50 We compared the pooled ES for the high, unclear, and low risk of bias categories for each of the six domains and overall risk of bias. The differences were compared statistically using a random effects meta-regression model.

The effect of specific covariates on risk of bias was analyzed using a logistic regression. We also tested these covariates for their effect on the association between risk of bias and ES in a subgroup analysis. For this purpose, kappas were compared using p-values computed from standard errors and the central limit theorem. The covariates examined were intervention type (pharmacological or nonpharmacological), nature of the intervention (behavioral/psychological, device, drug, natural health product, surgical, vaccine, other), study design (parallel vs. other), funding source (industry vs. other), type of trial (efficacy/superiority vs. other), and nature of outcome (subjective or objective).

Software: Cohen's and weighted kappa statistics were obtained using StatXact 7.0, while Fleiss' kappa was computed manually in Microsoft Excel 2007. Meta-regression analysis was performed using Stata/IC version 11.2, and meta-analysis was done both in Stata and Review Manager version 5.1.5.

Newcastle-Ottawa Scale and Cohort Studies

Study Selection

We used an iterative approach to identify a sample of cohort studies based on meta-analyses of cohort studies. Initially, we searched completed EPC reports to identify meta-analyses of cohort studies. We found 3 EPC reports51-53 including 36 cohort studies that met the inclusion criteria (see below). We subsequently conducted searches in Medline using search terms to capture systematic reviews (meta-analys?s.mp, review.pt and search.tw), cohort studies (exp Cohort Studies/, cohort$.tw, (observation$ adj stud$).tw) and meta-analyses (exp meta-analysis/, (analysis adj3 (group$ or pool$)).tw, (forest adj plot$).mp). Results were limited to English language studies in humans that were published in 2000 or later. We searched by year starting with the most recent, and continued until we identified a sufficient number of studies.

A meta-analysis was considered appropriate to include if it had at least 10 cohort studies, assessed a dichotomous outcome, and had substantial statistical heterogeneity (i.e., I2 > 50 percent). Previous meta-epidemiological research has used a minimum sample size per meta-analysis of 5 to 10 studies.16,54 This ensures that there is a sufficient pool of studies with some degree of variability in each meta-analysis in order to test the hypotheses. Some degree of heterogeneity is required in order to test whether quality, as assessed by the NOS, can differentiate studies with different effect estimates.

Our target sample size was 125 cohort studies. Initially, 144 cohort studies from 8 meta-analyses were identified; however, 13 studies were not assessed because they were later determined to be the incorrect study design (4 RCT;55-58 6 case series/case-controls59-64), or they could not be retrieved 65-67. Our final sample included 131 cohort studies (Appendix D).

Quality Assessments

All studies were independently assessed by two reviewers using the NOS. One reviewer was from the University of Alberta EPC and one reviewer was from the University of Ottawa EPC. Discrepancies were resolved through discussion to produce consensus assessments for each study.

Reviewers pilot tested the NOS on three studies68-70 and met by teleconference to discuss any disagreements in general interpretation of the tool. Decision rules were developed to accompany existing guidance for the NOS (Appendix E and F). A priori we asked clinical experts to provide the minimum length of followup for each review question (Appendix F). We planned for pilot testing of an additional sample of studies if there was substantial disagreement. This was not deemed necessary after the initial pilot testing phase.

Data Extraction

The outcomes and data for effect estimates were based on the meta-analysis and checked against the primary studies by a single reviewer. The statistician double-checked data that were unclear.

Data Analysis

Reliability of the NOS. Inter-rater agreement was calculated for each domain and for overall quality assessment using weighted45 or unweighted Cohen's kappa statistics,71 as appropriate. The former was used when the studies could be classified into one of three or more ordinal categories, while the latter was used when only two categories were possible. Agreement was categorized as above.46

Validity of the NOS. For the results of the individual meta-analyses, we coded endpoints consistently so that the outcome occurrence was undesired (e.g., death vs. survival). Within each meta-analysis, we generated a ratio of odds ratio (i.e., odds ratios for studies with and without the domain of interest or of high/low quality as assessed by the NOS). To maintain consistency, we used odds ratios to summarize all meta-analyses, even if this was not the statistic that was used in the original meta-analysis. The ratios of odds ratios for each meta-analysis were combined to give an overall estimate of differences in effect estimates using meta-analytic techniques with inverse-variance weighting and a random effects model.72

Views

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...