NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
Gartlehner G, Dobrescu A, Evans TS, et al. Assessing the Predictive Validity of Strength of Evidence Grades: A Meta-Epidemiological Study [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2015 Sep.
Assessing the Predictive Validity of Strength of Evidence Grades: A Meta-Epidemiological Study [Internet].
Show detailsWe used a meta-epidemiological approach based on large, systematically appraised bodies of evidence that authors of Cochrane reports had graded as high SOE. We used effect estimates of such bodies of evidence as reference points because a grade of high SOE implies that investigators were very confident that the estimate of effect is close to the truth and that new studies are unlikely to change conclusions. Thus, we used these estimates as “gold standards” to determine the predictive validity. We did not assess the correctness of SOE grades in the Cochrane reports because we wanted to take a pragmatic perspective using real-world examples rather than an explanatory perspective using an ideal dataset. We assumed that users of systematic reviews would also take grades of SOE at face value.
Assembling Empirical Data
We searched the Cochrane Library from 2010 onward to find Cochrane reports that: (1) include an outcome with more than eight randomized controlled trials (RCTs) on therapeutic interventions that had been graded as high SOE; (2) present meta-analytic outcomes that were reported as relative risks or odds ratios for binary outcomes or as weighted mean differences or standardized mean differences (SMDs) for continuous outcomes; and (3) provide data to reproduce the meta-analyses. We chose a threshold of eight RCTs so that we had enough studies to meta-analyze portions of these bodies of evidence in a chronological order of publication.
Overall, we drew information from 37 Cochrane reports on 50 bodies of evidence that had been graded as high SOE. Table 2 presents characteristics of these bodies of evidence.
Preparing “Gradeable” Documents
From each of the 50 included bodies of evidence, we used portions in a chronological order of publication to prepare a total of 160 documents (which we called “gradeable” documents) reflecting different SOE categories. Sample size calculations indicated that 130 documents would provide 80 percent power for a 4 × 2 chi-square test of SOE (high, medium, low, or insufficient) by stability of results (stable vs. not stable) for a medium-sized effect (Cohen's d of 0.3) as a threshold for stability.
In a first step, we reanalyzed each body of evidence using cumulative meta-analyses. In general, a cumulative meta-analysis shows how the body of evidence evolves over time as new studies accrue. Likewise, the SOE changes (or can be expected to change) over time as new studies contribute to the body of evidence. Using information from the cumulative meta-analyses and information about individual studies from the Cochrane reports (e.g., risk of bias ratings), an independent investigator (who was not involved in the subsequent grading of the SOE) meta-analyzed the portions of the high-strength bodies of evidence in a chronological order (e.g., the first four studies, the first six studies, etc.) to prepare the gradeable documents.
Figure 1 illustrates this concept. The investigator took risk of bias of individual studies, precision of estimates, consistency of studies, indirectness, and the other domains of the grading scheme into consideration to decide what portions of studies were used for the gradeable documents.
The aim was to create approximately 40 documents for each category of SOE with sufficient information for the project's investigators to grade the SOE. These documents included: information on the objective of the Cochrane review; the PICO (population-intervention-control-outcome); study characteristics and risk of bias ratings of included trials as presented in the Cochrane report; a forest plot of a random effects meta-analysis; information about minimal important differences for continuous outcomes; and information about reporting bias (funnel plot, Kendell's tau, Egger's regression intercept, and Fail-Safe N). We relied on judgments of the Cochrane authors regarding risk of bias of individual trials. We pilot-tested the format and content of the gradeable documents and revised them based on feedback from investigators. Appendix A provides an example of a gradeable document.
Grading Strength of Evidence
To grade the SOE, investigators used EPC guidance for GRADE. Investigators took part in a calibration exercise and had access to a published guidance document.4
We randomly allocated 160 gradeable documents to 13 investigators from six U.S. and Canadian EPCs and Cochrane Austria. All are professional systematic reviewers; however, their experience with GRADE varied. Three investigators (23 percent) stated that they had used the GRADE approach for more than 20 systematic reviews; three (23 percent) used the approach for 10 to 15 systematic reviews; one (8 percent) used the approach for 6 to 10 reviews; and six investigators (46 percent) declared that they had used GRADE for up to 5 systematic reviews.
A research associate at RTI International connected each participant with a unique identification number and emailed the gradeable documents. This research associate was not involved in either the grading exercise or analysis of results. Two investigators, blinded to the results of the underlying Cochrane report (i.e., the reference standard), graded each body of evidence independently. Investigators were blinded to the second person grading the same body of evidence. When grades differed, the research associate put investigators in contact with each other; investigators resolved conflicts by consensus or by involving a third, senior researcher.
Assessing the Stability of Effect Estimates
To determine the stability of effects, we compared effect estimates of the gradeable documents with the high SOE estimates from the Cochrane reports (the gold standard). To do so, we modified an approach developed to detect signals for updating systematic reviews.45 We used three definitions of stability (Table 3), which differed in the thresholds that determined whether the magnitude of treatment effects was similar. We deemed an estimate of effect as stable when (1) statistical significance did not change and (2) the magnitude of treatment effects remained similar to the high SOE estimate of the Cochrane report.
To avoid counting trivial or ‘borderline’ changes in statistical significance, we required that at least one of the two results had had a p-value outside the range of 0.04 to 0.06. In other words, we did not consider cases in which a p-value changed statistical significance within this range. For example, neither a change from p=0.041 to p=0.059 nor a change from p=0.059 to p=0.041 counted as a change in statistical significance.
Conducting Statistical Analysis
To assess the inter-rater reliability of reviewers grading the SOE, we calculated intra-class correlations using a one-way random effects model. Intra-class correlations measure the consistency of agreement of reviewers when dually grading bodies of evidence.
To determine the predictive validity, we compared the expected proportion of stable effect estimates (presented in Table 1) with the observed proportion of stable effect estimates for different thresholds from our sample. Statistically, predictive validity can be determined by calculating two characteristics: (1) calibration and (2) discrimination. Calibration refers to the ability to estimate correctly the likelihood of a future event. In our study, calibration is the ability to determine the likelihood that estimates remain stable. Discrimination refers to the ability to differentiate between those that will experience a future event and those that will not. In our study, discrimination is the ability to differentiate between effect estimates that will remain stable and those that will substantially change.46
We determined the calibration of the EPC approach to GRADE with the Hosmer-Lemeshow test47 and its discrimination with the concordance (C) index. Bodies of evidence that remain stable should have higher expected likelihoods than those that do not. The C index compares the expected likelihoods from pairs of observations. In this case, the term “pairs” refers to stable versus not stable effect estimates, as shown below:48
Concordant pairs are pairs for which the expected likelihood for the stable body of evidence is higher than the expected likelihood for the nonstable body of evidence. Tied pairs are pairs for which the stable and nonstable bodies of evidence have the same expected likelihood. Higher values for the C index indicate better discrimination. A C index of 0.50 would indicate no discrimination between stable and nonstable bodies of evidence. We conducted all statistical analyses with the rcorr.cens procedure in the Hmisc package in R49 or Microsoft Excel.
- Methods - Assessing the Predictive Validity of Strength of Evidence Grades: A Me...Methods - Assessing the Predictive Validity of Strength of Evidence Grades: A Meta-Epidemiological Study
- ejaculatory bulb-specific protein 3-like precursor [Linepithema humile]ejaculatory bulb-specific protein 3-like precursor [Linepithema humile]gi|827012498|ref|NP_001296370.1|Protein
Your browsing activity is empty.
Activity recording is turned off.
See more...