U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Pignone M, Gaynes BN, Rushton JL, et al. Screening for Depression [Internet]. Rockville (MD): Agency for Healthcare Research and Quality (US); 2002 Apr. (Systematic Evidence Reviews, No. 6.)

  • This publication is provided for historical reference only and the information may be out of date.

This publication is provided for historical reference only and the information may be out of date.

Cover of Screening for Depression

Screening for Depression [Internet].

Show details

Appendix C Grading System

Criteria for Grading the Internal Validity of Individual Studies

Introduction

The Methods Work Group for the U.S. Preventive Services Task Force (USPSTF) developed a set of criteria by which the quality of individual studies could be evaluated in terms of both internal validity and external validity. The USPSTF accepted the criteria, and the associated definitions of quality categories, that relate to internal validity at its September 1999 quarterly meeting.

This document describes the criteria relating to internal validity and the procedures that topic teams will follow for all updates and new assessments in making these judgments. The overall evaluation for each study is recorded in the Evidence Tables in Appendix D.

All topic teams will use initial "filters" to select studies for review that deal most directly with the question at issue and that are applicable to the population at issue. Thus, studies of any design that use outdated technology or that use technology that is not feasible for primary care practice may be filtered out before the abstraction stage, depending on the topic and the decisions of the topic team. The teams will justify such exclusion decisions if there could be reasonable disagreement about this step. The criteria below are meant for those studies that pass this initial filter.

Design-Specific Criteria and Quality Category Definitions

Presented below are a set of minimal criteria for each study design and then a general definition of 3 categories, "good," "fair," and "poor," based on those criteria. These specifications are not meant to be rigid rules but rather are intended to be general guidelines, and individual exceptions, when explicitly explained and justified, can be made. In general, a "good" study is one that meets all criteria well. A "fair" study is one that does not meet (or it is not clear that it meets) at least 1 criterion but has no known "fatal flaw." "Poor" studies have at least 1 fatal flaw.

Systematic Reviews

Criteria

  • Comprehensiveness of sources considered/search strategy used
  • Standard appraisal of included studies
  • Validity of conclusions
  • Recency and relevance are especially important for systematic reviews

Definition of ratings from above criteria

Good

Recent, relevant review with comprehensive sources and search strategies; explicit and relevant selection criteria; standard appraisal of included studies; and valid conclusions.

Fair

Recent, relevant review that is not clearly biased but lacks comprehensive sources and search strategies.

Poor

Outdated, irrelevant, or biased review without systematic search for studies, explicit selection criteria, or standard appraisal of studies.

Case-Control Studies

Criteria

  • Accurate ascertainment of cases
  • Nonbiased selection of cases/controls with exclusion criteria applied equally to both
  • Response rate
  • Diagnostic testing procedures applied equally to each group
  • Measurement of exposure accurate and applied equally to each group
  • Appropriate attention to potential confounding variables

Definition of ratings based on criteria above

Good

Appropriate ascertainment of cases and nonbiased selection of case and control participants; exclusion criteria applied equally to cases and controls; response rate equal to or greater than 80%; diagnostic procedures and measurements accurate and applied equally to cases and controls; and appropriate attention to confounding variables.

Fair

Recent, relevant, without major apparent selection or diagnostic work-up bias but with response rate less than 80 percent or attention to some but not all important confounding variables.

Poor

Major selection or diagnostic work-up biases, response rates less than 50 percent, or inattention to confounding variables.

Randomized Controlled Trials and Cohort Studies

Criteria

  • Initial assembly of comparable groups for RCTs: adequate randomization, including first concealment and whether potential confounders were distributed equally among groups for cohort studies: consideration of potential confounders with either restriction or measurement for adjustment in the analysis; consideration of inception cohorts
  • Maintenance of comparable groups (includes attrition, cross-overs, adherence, contamination)
  • Important differential loss to follow-up or overall high loss to follow-up
  • Measurements: equal, reliable, and valid (includes masking of outcome assessment)
  • Clear definition of interventions
  • All important outcomes considered
  • Analysis: adjustment for potential confounders for cohort studies, or intention to treat analysis for RCTs.

Definition of ratings based on above criteria

Good

Meets all criteria: Comparable groups are assembled initially and maintained throughout the study (follow-up at least 80 percent); reliable and valid measurement instruments are used and applied equally to the groups; interventions are spelled out clearly; all important outcomes are considered; and appropriate attention to confounders in analysis. In addition, for RCTs, intention to treat analysis is used.

Fair

Studies will be graded "fair" if any or all of the following problems occur, without the fatal flaws noted in the "poor" category below: Generally, comparable groups are assembled initially but some question remains whether some (although not major) differences occurred with follow-up; measurement instruments are acceptable (although not the best) and generally applied equally; some but not all important outcomes are considered; and some but not all potential confounders are accounted for. Intention to treat analysis is done for RCTs.

Poor

Studies will be graded "poor" if any of the following fatal flaws exists: Groups assembled initially are not close to being comparable or maintained throughout the study; unreliable or invalid measurement instruments are used or not applied at all equally among groups (including not masking outcome assessment); and key confounders are given little or no attention. For RCTs, intention to treat analysis is lacking.

Diagnostic Accuracy Studies

Criteria

  • Screening test relevant, available for primary care, adequately described
  • Study uses a credible reference standard, performed regardless of test results
  • Reference standard interpreted independently of screening test
  • Handles indeterminate results in a reasonable manner
  • Spectrum of patients included in study
  • Sample size
  • Administration of reliable screening test

Definition of ratings based on above criteria

Good

Evaluates relevant available screening test; uses a credible reference standard; interprets reference standard independently of screening test; reliability of test assessed; has few or handles indeterminate results in a reasonable manner; includes large number (more than 100) broad-spectrum patients with and without disease.

Fair

Evaluates relevant available screening test; uses reasonable although not best standard; interprets reference standard independent of screening test; moderate sample size (50 to 100 subjects) and a "medium" spectrum of patients.

Poor

Has fatal flaw such as: Uses inappropriate reference standard; screening test improperly administered; biased ascertainment of reference standard; very small sample size or very narrow selected spectrum of patients.

Criteria for Grading Linkages in the Analytic Framework

Introduction

As noted in the previous document in this Appendix, the Methods Work Group for the U.S. Preventive Services Task Force (USPSTF) developed a set of criteria by which the quality of individual studies could be evaluated in terms of both internal validity. The Methods Work Group also developed definitions and criteria for judging the strength or quality of evidence for key questions -- ie, linkages in the analytic frameworks -- for the topics of systematic evidence reviews. These quality criteria were discussed at the May 1999 quarterly meeting and were essentially adopted for use by the Evidence-based Practice Centers in developing their first set of systematic evidence reviews. This document describes the criteria relating specifically to linkages in the analytic framework.1

Linkage Category Definitions

The rating scheme for grading the evidence for a linkage in the analytic framework rests on 3 classes of criteria: aggregate internal validity, aggregate external validity, and consistency or coherence. The Methods Work Group did not establish set formulae for arriving at any linkage score for these criteria sets. As with the criteria for quality of individual articles, they are intended to be applied as general guidelines, and the judgments are made implicitly. Judgments can be made about evidence of benefits and evidence of harms. In addition, a summative grade -- ie, an overall rating -- combining the evaluations of the 3 categories defined below can be given.

Also, as with the criteria for individual studies, these 3 categories can be labeled as "good," "fair," or "poor." That is, the linkages can be understood to be supported by good evidence, fair evidence, or poor evidence. The summative, overall rating can also range from good to poor.

Aggregate Internal Validity

This category refers to the overall extent to which data are valid for conditions addressed within studies. It would be rated according to quality grading information about individual studies.

Aggregate External Validity

This category concerns the generalizability of evidence to questions addressed by the linkage. This would include the concordance between populations, interventions and outcomes in the studies reviewed and those to which the linkage pertains. In short, this category reflects the applicability of the evidence to real-world conditions.

It is expected that differences between conditions examined in studies and those addressed by the linkages should be considered if they could potentially influence outcomes. These might include (but not necessarily be limited to): (a) biologic or pathologic characteristics; (b) incidence and prevalence of clinical conditions; (c) distribution of comorbid conditions that might affect outcomes; and (d) likelihood of acceptability and adherence on the part of patients or providers (or both).

Consistency

This category relates to the overall "coherence" of the body of evidence relating to the linkage. Specifically, it includes the number of studies, the homogeneity of those studies (in terms of clinical conditions, populations, settings, and the like), the level of precision of findings in the studies, and the direction of results. In addition, it can include dose-response relationships.

Footnotes

1The USPSTF is developing a separate set of criteria for rating its recommendations about an entire preventive service, including policies for appropriate extrapolation to populations or settings not reflected in the reviewed literature, but because the SERs do not contain USPSTF recommendations, those ways of grading recommendations are not dealt with here.

Views

  • PubReader
  • Print View
  • Cite this Page

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...