U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Gartlehner G, Dobrescu A, Evans TS, et al. Assessing the Predictive Validity of Strength of Evidence Grades: A Meta-Epidemiological Study [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2015 Sep.

Cover of Assessing the Predictive Validity of Strength of Evidence Grades: A Meta-Epidemiological Study

Assessing the Predictive Validity of Strength of Evidence Grades: A Meta-Epidemiological Study [Internet].

Show details

Results

Of 160 bodies of evidence, researchers dually graded 11 percent (n=17) as high, 42 percent (n=68) as moderate, 32 percent (n=51) as low, and 15 percent (n=24) as insufficient (very low) SOE. The inter-rater reliability was 0.56 (95% CI, 0.40 to 0.68), suggesting moderate agreement of researchers assigning SOE grades.

Concordance Between Expected and Observed Proportions of Stable Effect Estimates

For each grade, we compared the expected proportions of stable effect estimates with the observed proportion from our sample, using three different definitions of stability (see Methods and Table 2). Table 1 gave the proportions of estimates that producers and users of systematic reviews expected to remain stable for each SOE grade.

Overall, except for moderate SOE, the stability differed considerably between expected and observed proportions regardless of the definition used. Fewer estimates graded as high SOE in our sample remained stable relative to the expectations of producers and users of systematic reviews; that is, in our survey 208 experts expected high SOE outcomes to remain stable in at least 86 percent of the cases.6 In our sample, the observed proportions of stable estimates for definitions 1, 2, and 3 were, respectively, 71 percent, 76 percent, and 76 percent. Conversely, substantially more low or insufficient SOE estimates than expected remained stable. Table 4 presents expected and observed proportions of stable effect estimates by grade of SOE for each of the three definitions of stability.

Table 4. Comparison of expected with observed proportions of stable effect estimates for different definitions of stability.

Table 4

Comparison of expected with observed proportions of stable effect estimates for different definitions of stability.

Figures 2, 3, and 4 illustrate the overlap of expected proportions of stable effects (black large boxes) and confidence intervals (CI) of observed proportions (grey columns) for different grades of SOE and different definitions of stability. The circles in the columns reflect the point estimates. The y-axis delineates the proportion of estimates that remained stable; the x-axis presents the four grades of SOE. For insufficient SOE, for example, producers and users of systematic reviews expected 0 percent to 33 percent of estimates to remain stable as new studies are added to the evidence base. For definition 1, which was the most rigorous of the three definitions of stability, more than half (54 percent) of effect estimates graded as insufficient remained stable. The CIs ranged from 33 percent to 74 percent, which barely overlaps the expected range for insufficient SOE. For the less rigorous definitions 2 and 3, CIs did not overlap at all with the range that producers and users of systematic reviews expected from insufficient SOE grades. By contrast, observed proportions of stable results for moderate SOE grades were concordant for all three definitions. Confidence intervals overlap widely with the range of expected proportions. Estimates graded as low SOE show some concordance for definitions 1 and 3 but little for definition 2.

Figure 2. illustrates the overlap of expected proportions of stable effects (black large boxes) and confidence intervals (CI) of observed proportions (grey columns) for different grades of SOE and different definitions of stability. The circles in the columns reflect the point estimates. The y-axis delineates the proportion of estimates that remained stable; the x-axis presents the four grades of SOE.

Figure 2

Comparison of expected proportions of stable effect estimates with confidence intervals of observed proportions for different definitions of stability—Definition 1.

Figure 3. illustrates the overlap of expected proportions of stable effects (black large boxes) and confidence intervals (CI) of observed proportions (grey columns) for different grades of SOE and different definitions of stability. The circles in the columns reflect the point estimates. The y-axis delineates the proportion of estimates that remained stable; the x-axis presents the four grades of SOE.

Figure 3

Comparison of expected proportions of stable effect estimates with confidence intervals of observed proportions for different definitions of stability—Definition 2.

Figure 4. illustrates the overlap of expected proportions of stable effects (black large boxes) and confidence intervals (CI) of observed proportions (grey columns) for different grades of SOE and different definitions of stability. The circles in the columns reflect the point estimates. The y-axis delineates the proportion of estimates that remained stable; the x-axis presents the four grades of SOE.

Figure 4

Comparison of expected proportions of stable effect estimates with confidence intervals of observed proportions for different definitions of stability—Definition 3.

Predictive Validity of the EPC Approach to GRADE

To determine the predictive validity of the EPC approach to GRADE, we assessed the calibration (how accurately it can predict the likelihood that effect estimates will remain stable as new evidence evolves) and the discrimination (how accurately it can differentiate between effect estimates that will remain stable and those that will substantially change). In theory, an ideal predictive tool would reliably identify estimates with a high likelihood of remaining stable and always grade them as high SOE. Conversely, effect estimates with a very low likelihood of remaining stable would always be graded as insufficient. Such an ideal tool would have high calibration and a C index of 1.

Overall, regardless of the definition used, the calibration of the EPC approach to GRADE was suboptimal. When we compared observed proportions of stable effect estimates with lower, middle, and upper values of the ranges of expected proportions, eight of nine comparisons were statistically significantly different based on the Hosmer-Lemeshow test (Table 5), indicating a lack of calibration.

Table 5. Results of Hosmer-Lemeshow tests for different expected and observed proportions of stability.

Table 5

Results of Hosmer-Lemeshow tests for different expected and observed proportions of stability.

Likewise, the C indices for the EPC approach to GRADE were low, with values close to that expected by chance (i.e., C index=0.50). For definitions 1, 2, and 3, the C indices were 0.57 (95% CI, 0.50 to 0.67), 0.56 (95% CI, 0.47 to 0.66), and 0.58 (95% CI, 0.50 to 0.67), respectively. C indices for definitions 1 and 3 reached statistical significance (CIs did not cross 0.5). Taking the uncertainty of the confidence intervals into consideration, results mean that in the worst case (lower limit of CIs), the EPC approach to GRADE has no discriminatory ability for distinguishing between effect estimates with a low or high likelihood of remaining stable. In the best case (upper confidence limits), it can accurately distinguish between effect estimates with a low or high likelihood of remaining stable in 67 percent of cases.

The low overall predictive validity, however, is caused primarily by the discordance of expected and observed proportions of stable effect estimates for high and insufficient SOE. In a post-hoc sensitivity analysis, we chose proportions within the expected ranges (Table 1) that were closest to the observed proportions of stable effect estimates. Using expected proportions of 86 percent for high (lower end of expected range), 71 percent for moderate, 60 percent for low, and 33 percent for insufficient SOE (both upper end of expected range), we found that the EPC approach to GRADE achieved satisfactory calibration for definitions 1 and 3 (Table 5).

Views

  • PubReader
  • Print View
  • Cite this Page
  • PDF version of this title (1.6M)

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...