U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Institute for Quality and Efficiency in Health Care. General Methods [Internet]. Version 4.2. Cologne, Germany: Institute for Quality and Efficiency in Health Care (IQWiG); 2015 Apr 22.

Logo of Institute for Quality and Efficiency in Health Care (IQWiG)

Institute for Quality and Efficiency in Health Care (IQWiG): IQWiG Methods Resources [Internet].

8Assessment of information

In research the term “bias” means a systematic deviation between research results and the “truth” [473]. For example, this may refer to an erroneously too high (or too low) estimation of a treatment effect.

A main objective in the benefit assessment of medical services is to estimate the actual effect of therapies and interventions as reliably and unbiasedly as possible. In order to minimize bias in the benefit assessment of medical services, different approaches are adopted internationally; these include using scientifically robust methods, ensuring wide participation in the relevant studies, as well as avoiding conflicts of interest [105]. All these methods also form the legal basis of the Institute's work.

8.1. Quality assessment of individual studies

8.1.1. Criteria for study inclusion

The problem often arises that studies relevant to a benefit assessment do not completely fulfil the inclusion criteria for the patient population and/or the test and comparator intervention defined in the systematic review. In this case the Institute usually proceeds according to the following criteria:

For the inclusion criterion with regard to the study population, it suffices if at least 80% of the patients included in the study fulfil this criterion. Corresponding subgroup analyses are drawn upon if they are available in such studies. Studies in which the inclusion criterion for the study population is fulfilled in less than 80% of the patients included in the study are only included in the analysis if corresponding subgroup analyses are available, or if it has been demonstrated with sufficient plausibility or has been proven that the findings obtained from this study are applicable to the target population of the systematic review (see Section 3.3.1 for applicability).

Studies are also included in which at least 80% of patients fulfil the inclusion criterion regarding the test intervention (intervention group of the study) and at least 80% fulfil the inclusion criterion regarding the comparator intervention (comparator group of the study). If 1 of the 2 criteria is violated in a study, it is excluded from the benefit assessment.

8.1.2. Relationship between study type and research question

Only the most relevant study designs that play a role in benefit assessments in medical research (depending on the research question posed) are summarized here.

It is primarily the inclusion of a control group that is called for in the benefit assessment of interventions. In a design with dependent samples without a control group, proof of the effect of an intervention cannot usually be inferred from a pure “before-after” comparison. Exceptions include diseases with a deterministic (or practically deterministic) course (e.g. ketoacidotic diabetic coma; see Section 3.2.2). Randomization and blinding are quality criteria that increase the evidential value of controlled studies. Parallel group studies [442], cross-over studies [314], and cluster randomized studies [155] are common designs for clinical trials. If interim analyses are planned, the use of appropriate sequential designs must be considered [590].

Case reports or case series often provide initial information on a topic. These are susceptible to all kinds of bias, so that, depending on the research question, only limited reliable evidence can be inferred from this type of study. The prevalence of diseases can be estimated from population-based cross-sectional studies. Other fundamental and classical study types in epidemiology are case-control studies [59] to investigate the association between exposures and the occurrence of rare diseases, as well as cohort studies [60] to investigate the effect of an exposure over time. Cohort studies designed for this purpose are prospective, although retrospective cohort studies are also conducted in which past exposure is recorded (this type of study is frequently found in occupational or pharmacological epidemiology). In principle, prospective designs are preferable to retrospective designs. However, case-control studies, for example, are frequently the only feasible way of obtaining information on associations between exposures and rare diseases. Newer study designs in modern epidemiology contain elements of both case-control and cohort studies and can no longer be clearly classified as retrospective or prospective [317].

Diagnostic and screening studies may have very different aims, so that the assessment depends on the choice of an appropriate design (see Sections 3.5 and 3.6).

8.1.3. Ranking of different study types/evidence levels

Different approaches exist within the framework of systematic reviews or guideline development for allocating specific evidence levels to particular study types [237,242]. These levels can be used to create a ranking with regard to the validity of evidence from different study types. However, no system of evidence assessment currently exists that is generally accepted and universally applicable to all systematic reviews [318,588]. Due to the complexity of the appraisal of studies, no conclusive judgement on quality can be inferred from the hierarchy of evidence [24,599]. In general, the Institute follows the rough hierarchy of study types, which is widely accepted and is also largely consistent with the evidence classification of the G-BA [211], and has been incorporated in the regulation on the benefit assessment of drugs according to §35a SGB V [80]. At least for the evaluation of intervention effects, the highest evidence level is allocated to RCTs and systematic reviews of RCTs. In some classifications, individual RCTs are further graded into those of higher or lower quality (see Section 3.1.4).

However, at the latest in the classification of non-randomized studies with regard to their risk of bias, the study design alone can no longer provide sufficient orientation [234,261,576], even if the basis distinction between comparative and non-comparative studies seems meaningful. As described in Section 3.8, in the classification of non-randomized studies, besides other design aspects the Institute will primarily evaluate the control for potential confounders. However, this grading refers to the risk of bias (see Section 8.1.4) and not to the evidence level of the study.

8.1.4. Aspects of the assessment of the risk of bias

One main aspect of the interpretation of study results is the assessment of the risk of bias (see qualitative uncertainty of results, Section 3.1.4). In this context, the research question, the study type and design, and the conduct of the study play a role, as well as the availability of information. The risk of bias is substantially affected by the study quality; however, its assessment is not equivalent to the quality assessment of a study. For example, individual outcomes may also be considerably biased in a high-quality study. Other studies, however, may provide high certainty of results for specific outcomes in individual cases, despite being of low quality. As a rule, the Institute will therefore estimate the extent of the risk of bias in a problem-orientated manner for all relevant results (both for the study and the specific outcomes).

In principle, a recognized standardized concept should be followed in a study; from planning to conduct, data analysis, and reporting. This includes a study protocol describing all the important methods and procedures. For (randomized) clinical trials, the usual standards are defined by the basic principles of good clinical practice (GCP) [299,331]; for epidemiological studies, they are defined by guidelines and recommendations to ensure good epidemiological practice (GEP) [132]. In this context, a key criterion to avoid bias is whether the study was actually analysed in the way planned. This cannot usually be reliably concluded from the relevant publications. However, a section on sample size planning may at least provide indications in this regard. In addition, a comparison with the study protocol (possibly previously published) or with the corresponding publication on the study design is useful.

The following important documents were developed to improve the quality of publications:

the CONSORT statement on RCTs [496] and the corresponding explanatory document [396]

a proposal for an extension of the CONSORT statement for randomized studies on non-drug interventions [55] and the corresponding explanatory document [54]

the CONSORT statement on cluster-randomized trials [93]

the CONSORT statement on the documentation of adverse events [302]

the CONSORT statement on non-inferiority and equivalence studies [441]

the CONSORT statement on pragmatic studies [604]

the CONSORT PRO extension for patient-reported outcomes [91]

the PRISMA33 statement on meta-analyses of randomized trials [397] and the corresponding explanatory document [357]

the TREND34 statement on non-randomized intervention trials [128]

the STROBE35 statement for observational studies in epidemiology [579] and the corresponding explanatory document [570]

the MOOSE36 checklist for meta-analysis of observational studies in epidemiology [534]

the STARD statement on diagnostic studies [52] and the corresponding explanatory document [53]

die ISOQOL37 Reporting Standards for patient-reported outcomes [75]

If a publication fails to conform to these standards, this may be an indicator of an increased risk of bias of the results of the relevant study. Additional key publications on this issue describe fundamental aspects concerning the risk-of-bias assessment [165,236,264].

Key aspects of the Institute's risk-of-bias assessment of the results of RCTs comprise

adequate concealment, i.e. the unforeseeability and concealment of allocation to groups (e.g. by external randomization in trials that cannot be blinded)

blinded outcome assessment in trials where blinding of physicians and patients is not possible

appropriate application of the “intention-to-treat” (ITT) principle

There must be a more cautious interpretation of the results of unblinded trials, or of trials where unblinding (possibly) occurred, compared with the interpretation of blinded studies. Randomization and the choice of appropriate outcome variables are important instruments to prevent bias in studies where a blinding of the intervention was not possible. In studies that cannot be blinded, it is crucial to ensure adequate concealment of the allocation of patients to the groups to be compared. It is also necessary that the outcome variable is independent of the (non-blinded) treating staff or assessed in a blinded manner independent of the treating staff (blinded assessment of outcomes). If a blinded assessment of outcome measures is not possible, a preferably objective outcome should be chosen which can be influenced as little as possible (with regard to its dimension and the stringency of its recording) by the (non-blinded) person assessing it.

In the production of reports standardized assessment forms are generally used to assess the risk of bias of study results. As a rule, for controlled studies on the benefit assessment of interventions the following items across and specific to outcomes are considered in particular:

Items across outcomes:

appropriate generation of a randomization sequence (in randomized studies)

allocation concealment (in randomized studies)

temporal parallelism of the intervention groups (in non-randomized studies)

comparability of intervention groups and appropriate consideration of prognostically relevant factors (in non-randomized studies)

blinding of patients and treating staff/staff responsible for follow-up treatment

reporting of all relevant outcomes independent of results

Outcome-specific items:

blinding of outcome assessors

appropriate implementation of the ITT principle

reporting of individual outcomes independent of results

On the basis of these aspects, in randomized studies the risk of bias is summarized and classified as “high” or “low”. A low risk of bias is present if it can be excluded with great probability that the results are relevantly biased. Relevant bias is understood to be a change in the basic message of the results if the bias were to be corrected.

In the assessment of an outcome, the risk of bias across outcomes is initially classified as “high” or “low”. If classified as “high”, the risk of bias for the outcome is also usually classified as “high”. Apart from that, the outcome-specific items are taken into account.

The classification as “high” of the risk of bias of the result for an outcome does not lead to exclusion from the benefit assessment. This classification rather serves the discussion of heterogeneous study results and affects the certainty of the conclusion.

No summarizing risk-of-bias assessment is usually performed for non-randomized comparative studies, as their results generally carry a high risk of bias due to the lack of randomization. The Institute specifically deviates from this procedure in assessments of the potential of new examination and treatment methods (see Section 3.8).

If a project of the Institute involves the assessment of older studies that do not satisfy current quality standards because they were planned and conducted at a time when these standards did not exist, then the Institute will present the disadvantages and deficiencies of these studies and discuss possible consequences. A different handling of these older studies compared with the handling of newer studies that have similar quality deficits is however only necessary if this is clearly justifiable from the research question posed or other circumstances of the assessment.

The assessment of formal criteria provides essential information on the risk of bias of the results of studies. However, the Institute always conducts a risk-of-bias assessment that goes beyond purely formal aspects in order, for example, to present errors and inconsistencies in publications, and to assess their relevance in the interpretation of results.

8.1.5. Interpretation of composite outcomes

A “composite outcome” comprises a group of events defined by the investigators (e.g. myocardial infarctions, strokes, cardiovascular deaths). In this context the individual events in this group often differ in their severity and relevance for patients and physicians (e.g. hospital admissions and cardiovascular deaths). Therefore, when interpreting composite outcomes one needs to be aware of the consequences thereby involved [111,189,202]. The following explanations describe the aspects to be considered in the interpretation of results. However, they specifically do not refer to a (possibly conclusive) assessment of benefit and harm by means of composite outcomes, if, for example, the potential harm from an intervention (e.g. increase in severe bleeding events) is included in an outcome together with the benefit (e.g. decrease in the rate of myocardial infarctions).

A precondition for consideration of a composite outcome is that the individual components of the composite outcome all represent patient-relevant outcomes defined in the report plan. In this context surrogate endpoints can be only included if they are specifically accepted by the Institute as valid (see Section 3.1.2). The results for every individual event included in a composite outcome should also be reported separately. The components should be of similar severity; this does not mean that they must be of identical relevance. For example, the outcome “mortality” can be combined with “myocardial infarction” or “stroke”, but not with “silent myocardial infarction” or “hospital admission”.

If a composite outcome fulfils the preconditions stated above, then the following aspects need to be considered in the interpretation of conclusions on benefit and harm:

Does the effect of the intervention on the individual components of the composite outcome usually take the same direction?

Was a relevant outcome suited to be included in the composite outcome not included, or excluded, without a comprehensible and acceptable justification?

Was the composite outcome defined a priori or introduced post hoc?

Insofar as the available data and data structures allow, sensitivity analyses may be performed by comparing the exclusion versus the inclusion of individual components.

If the relevant preconditions are fulfilled, individual outcomes may be determined and calculated from a composite outcome within the framework of a benefit assessment.

8.1.6. Interpretation of subgroup analyses

In the methodological literature, subgroup analyses are a matter of controversy [22,429]. The interpretation of results of subgroup analyses at a study level is complicated mainly by 3 factors:

No characteristic of proof: Subgroup analyses are rarely planned a priori and are rarely a component of the study protocol (or its amendments). If subgroup analyses with regard to more or less arbitrary subgroup-forming characteristics are conducted post hoc, the results cannot be regarded as a methodologically correct testing of a hypothesis.

Multiple testing: If several subgroups are analysed, results in a subgroup may well reach statistical significance, despite actually being random.

Lack of power: The sample size of a subgroup is often too small to enable the detection of moderate differences (by means of inferential statistics), so that even if effects actually exist, significant results cannot be expected. The situation is different if an adequate power for the subgroup analysis was already considered in the sample size calculation and a correspondingly larger sample size was planned [67].

The results of subgroup analyses should be considered in the assessment, taking the above 3 issues into account and not dominating the result of the primary analysis, even more so if the primary study objective was not achieved. An exception from this rule may apply if social law implications (see below) necessitate such analyses. Moreover, subgroup analyses are not interpretable if the subgroup-forming characteristic was defined after initiation of treatment (after randomization), e.g. in responder analyses. These aspects also play a role in the conduct and interpretation of subgroup analyses within the framework of meta-analyses (see Section 8.3.8).

The statistical demonstration of different effects between various subgroups should be conducted by means of an appropriate homogeneity or interaction test. The finding that a statistically significant effect was observed in one subgroup, but not in another, cannot be interpreted (by means of inferential statistics) as the existence of a subgroup effect.

Analyses of subgroups defined a priori represent the gold standard for subgroup analyses, where stratified randomization by means of subgroups and appropriate statistical methods for data analysis (homogeneity test, interaction test) are applied [114].

Despite the limitations specified above, for some research questions subgroup analyses may represent the best scientific evidence available in the foreseeable future in order to assess effects in subgroups [200], since factors such as ethical considerations may argue against the reproduction of findings of subgroup analyses in a validation study. Rothwell [458] presents an overview of reasons for conducting subgroup analyses. Sun et al. [536] identified criteria to assess the credibility of subgroup analyses.

Possible heterogeneity of an effect in different, clearly distinguishable patient populations is an important reason for conducting subgroup analyses [335,458]. If a-priori information is available on a possible effect modifier (e.g. age, pathology), it is in fact essential to investigate possible heterogeneity in advance with regard to the effect in the various patient groups. If such heterogeneity exists, then the estimated total effect across all patients cannot be interpreted meaningfully [335]. It is therefore important that information on a possible heterogeneity of patient groups is considered appropriately in the study design. It may even be necessary to conduct several studies [228]. Within the framework of systematic reviews, the analysis of heterogeneity between individual studies (and therefore, if applicable, subgroup analyses) is a scientific necessity (see Section 8.3.8), but also a necessity from the perspective of social law, as according to §139a (2) SGB V, the Institute is obliged to consider characteristics specific to age, gender, and life circumstances. In addition, according to the official rationale for the SHI Modernization Act38, the Institute is to elaborate in which patient groups a new drug is expected to lead to a relevant improvement in treatment success, with the aim of providing these patients with access to this new drug [134]. A corresponding objective can also be found in §35a SGB V regarding the assessment of the benefit of drugs with new active ingredients [136]. In this assessment, patient groups should be identified in whom these drugs show a therapeutically relevant added benefit. According to social law, a further necessity for subgroup analyses may arise due to the approval status of drugs. On the one hand, this may be the consequence of the decision by regulatory authorities that, after balancing the efficacy and risks of a drug, may determine that it will only be approved for part of the patient population investigated in the approval studies. These considerations may also be based on subgroup analyses conducted post hoc. On the other hand, studies conducted after approval may include patient groups for whom the drug is not approved in Germany; the stronger approvals differ on an international level, the more this applies. In such cases, subgroup analyses reflecting the approval status of a drug may need to be used, independently of whether these analyses were planned a priori or conducted post hoc.

8.1.7. Assessment of data consistency

To assess the evidential value of study results, the Institute will review the consistency of data with regard to their plausibility and completeness. Implausible data are not only produced by incorrect reporting of results (typing, formatting, or calculation errors), but also by the insufficient or incorrect description of the methodology, or even by forged or invented data [9]. Inconsistencies may exist within a publication, and also between publications on the same study.

One problem with many publications is the reporting of incomplete information in the methods and results sections. In particular, the reporting of lost-to-follow-up patients, withdrawals, etc., as well as the way these patients were considered in the analyses, are often not transparent.

It is therefore necessary to expose potential inconsistencies in the data. For this purpose, the Institute reviews, for example, calculation steps taken, and compares data presented in text, tables, and graphs. In practice, a common problem in survival-time analyses arises from inconsistencies between the data on lost-to-follow-up patients and those on patients at risk in the survival curve graphs. For certain outcomes (e.g. total mortality), the number of lost-to-follow-up patients can be calculated if the Kaplan-Meier estimates are compared with the patients at risk at a point in time before the minimum follow-up time. Statistical techniques may be useful in exposing forged and invented data [9].

If relevant inconsistencies are found in the reporting of results, the Institute's aim is to clarify these inconsistencies and/or obtain any missing information by contacting authors, for example, or requesting the complete clinical study report and further study documentation. However, it should be considered that firstly, enquiries to authors often remain unanswered, especially concerning older publications, and that secondly, authors' responses may produce further inconsistencies. In the individual case, a weighing-up of the effort involved and the benefit of such enquiries is therefore meaningful and necessary. If inconsistencies cannot be resolved, the potential impact of these inconsistencies on effect sizes (magnitude of bias), uncertainty of results (increase in error probability), and precision (width of the confidence intervals) will be assessed by the Institute. For this purpose, sensitivity analyses may be conducted. If it is possible that inconsistencies may have a relevant impact on the results, this will be stated and the results will be interpreted very cautiously.

8.2. Consideration of systematic reviews

Systematic reviews are publications that summarize and assess the results of primary studies in a systematic, reproducible, and transparent way. This also applies to HTA reports, which normally aim to answer a clinical and/or patient-relevant question. HTA reports also often seek to answer additional questions of interest to contracting agencies and health policy decision makers [156,353,435]. There is no need to differentiate between systematic reviews and HTA reports for the purposes of this section. Therefore, the term “systematic review” also includes HTA reports.

8.2.1. Classification of systematic reviews

Relying on individual scientific studies can be misleading. Looking at one or only a few studies in isolation from other similar studies on the same question can make treatments appear more or less useful than they actually are [1]. High quality systematic reviews aim to overcome this form of bias by identifying, assessing and summarizing the evidence systematically rather than selectively [156,165,216,435].

Systematic reviews identify, assess and summarize the evidence from one or several study types that can provide the best answer to a specific and clearly formulated question. Systematic and explicit methods are used to identify, select and critically assess the relevant studies for the question of interest. If studies are identified, these data are systematically extracted and analysed. Systematic reviews are non-experimental studies whose methodology must aim to minimize systematic errors (bias) on every level of the review process [1,165,264].

For systematic reviews of the effects of medical interventions, RCTs provide the most reliable answers. However, for other questions such as aetiology, prognosis or the qualitative description of patients' experiences, the appropriate evidence base for a systematic review will consist of other primary study types [216]. Systematic reviews of diagnostic and screening tests also show some methodological differences compared with reviews of treatment interventions [122].

In the production of the Institute's reports, systematic reviews are primarily used to identify potentially relevant (primary) studies. However, an IQWiG report can be based partially or even solely on systematic reviews (see Section 8.2.2). Health information produced by the Institute for patients and consumers is to a large part based on systematic reviews. This includes systematic reviews of treatments, and reviews addressing other questions such as aetiology, adverse effects and syntheses of qualitative research (see Section 6.3.3).

The minimal prerequisite for a systematic review on the effects of treatments to be used by the Institute is that it has only minimal methodological flaws according to the Oxman and Guyatt index [309,428,430] or the AMSTAR39 instrument [505-507]. In addition to considering the strength of evidence investigated in systematic reviews, the Institute will also consider the relevance and applicability of the evidence. This includes investigating the question as to whether the results have been consistent among different populations and subgroups as well as in different healthcare contexts. The following factors are usually considered: the population of the participants in the included studies (including gender and baseline disease risk); the healthcare context (including the healthcare settings and the medical service providers); and the applicability and likely acceptance of the intervention in the form in which it was assessed [47,119].

8.2.2. Benefit assessment on the basis of systematic reviews

A benefit assessment on the basis of systematic reviews can provide a resource-saving and reliable evidence base for recommendations to the G-BA or the Federal Ministry of Health, provided that specific preconditions have been fulfilled [112,348]. In order to use systematic reviews in a benefit assessment these reviews must be of sufficiently high quality, that is, they must

show only a minimum risk of bias

present the evidence base in a complete, transparent, and reproducible manner

and thus allow clear conclusions to be drawn [23,428,594]. In addition, it is an essential prerequisite that the searches conducted in the systematic reviews do not contradict the Institute's methodology and that it is possible to transfer the results to the research question of the Institute's report, taking the defined inclusion and exclusion criteria into account.

The methodology applied must provide sufficient certainty that a new benefit assessment based on primary literature would not reach different conclusions from one based on systematic reviews. For example, this is usually not the case if a relevant amount of previously unpublished data is to be expected.

A. Research questions

In principle, this method is suited for all research questions insofar as the criteria named above have been fulfilled. The following points should be given particular consideration in the development of the research question:

definition of the population of interest

definition of the test intervention and comparator intervention of interest

definition of all relevant outcomes

if appropriate, specification of the health care setting or region affected (e.g. Germany, Europe)

The research question defined in this way also forms the basis for the specification of the inclusion and exclusion criteria to be applied in the benefit assessment, and subsequently for the specification of the relevance of the content and methods of the publications identified. On the basis of the research question, it is also decided which type of primary study the systematic reviews must be based on. Depending on the research question, it is possible that questions concerning certain parts of a commission are answered by means of systematic reviews, whereas primary studies are considered for other parts.

B. Minimum number of relevant systematic reviews

All systematic reviews that are of sufficient quality and relevant to the topic are considered. In order to be able to assess the consistency of results, at least 2 high-quality publications (produced independently of each other) should as a rule be available as the foundation of a report based on systematic reviews. If only one high-quality publication is available and can be considered, then it is necessary to justify the conduct of an assessment based only on this one systematic review.

C. Quality assessment of publications, including minimum requirements

The assessment of the general quality of systematic reviews is performed with Oxman and Guyatt's validated quality index for systematic reviews [427,428,430] or with the AMSTAR Instrument [505-507]. According to Oxman and Guyatt's index, systematic reviews are regarded to be of sufficient quality if they have been awarded at least 5 of 7 possible points in the overall assessment, which is performed by 2 reviewers independently of one another. No such threshold is defined for the AMSTAR Instrument and therefore should, if appropriate, be defined beforehand. In addition, as a rule, the sponsors of systematic reviews, as well as authors' conflicts of interests, are documented and discussed. Depending on the requirements of the project, the particular index criteria can be supplemented by additional items (e.g. completeness of the search, search for unpublished studies, for example in registries, or additional aspects regarding systematic reviews of diagnostic studies).

D. Results

For each research question, the results of a benefit assessment based on systematic reviews are summarized in tables, where possible. If inconsistent results on the same outcome are evident in several publications, possible explanations for this heterogeneity are described [310].

If the compilation of systematic reviews on a topic indicates that a new benefit assessment on the basis of primary studies could produce different results, then such an assessment will be performed.

E. Conclusion/recommendations

Reports based on systematic reviews summarize the results of the underlying systematic reviews and, if necessary, they are supplemented by a summary of up-to-date primary studies (or primary studies on questions not covered by the systematic reviews). Independent conclusions are then drawn from these materials.

The recommendations made on the basis of systematic reviews are not founded on a summary of the recommendations or conclusions of the underlying systematic reviews. In HTA reports, they are often formulated against the background of the specific socio-political and economic setting of a particular health care system, and are therefore rarely transferable to the health care setting in Germany.

8.2.3. Consideration of published meta-analyses

Following international EBM standards, the Institute's assessments are normally based on a systematic search for relevant primary studies, which is specific to the research question posed. If it is indicated and possible, results from individual studies identified are summarized and evaluated by means of meta-analyses. However, the Institute usually has access only to aggregated data from primary studies, which are extracted from the corresponding publication or the clinical study report provided. Situations exist where meta-analyses conducted on the basis of individual patient data (IPD) from relevant studies have a higher value (see Section 8.3.8). This is especially the case if, in addition to the effect caused solely by the intervention, the evaluation of other factors possibly influencing the intervention effect is also of interest (interaction between intervention effect and covariables). In this context, meta-analyses including IPD generally provide greater certainty of results, i.e. more precise results not affected by ecological bias, when compared with meta-regressions based on aggregated data [514]. In individual cases, these analyses may lead to more precise conclusions, particularly if heterogeneous results exist that can possibly be ascribed to different patient characteristics. However, one can only assume a higher validity of meta-analyses based on IPD if such analyses are actually targeted towards the research question of the Institute's assessment and also show a high certainty of results. The prerequisite for the assessment of the certainty of results of such analyses is maximum transparency; this refers both to the planning and to the conduct of analyses. Generally valid aspects that are relevant for the conduct of meta-analyses are outlined, for example, in a document published by EMA [172]. In its benefit assessments, the Institute considers published meta-analyses based on IPD if they address (sub)questions in the Institute's reports that cannot be answered with sufficient certainty by meta-analyses based on aggregated data. In addition, high certainty of results for the particular analysis is required.

8.3. Specific statistical aspects

8.3.1. Description of effects and risks

The description of intervention or exposure effects needs to be clearly linked to an explicit outcome variable. Consideration of an alternative outcome variable also alters the description and size of a possible effect. The choice of an appropriate effect measure depends in principle on the measurement scale of the outcome variable in question. For continuous variables, effects can usually be described using mean values and differences in mean values (if appropriate, after appropriate weighting). For categorical outcome variables, the usual effect and risk measures of 2×2 tables apply [36]. Chapter 9 of the Cochrane Handbook for Systematic Reviews of Interventions [124] provides a well-structured summary of the advantages and disadvantages of typical effect measures. Agresti [6,7] describes the specific aspects to be considered for ordinal data.

It is essential to describe the degree of statistical uncertainty for every effect estimate. For this purpose, the calculation of the standard error and the presentation of a confidence interval are methods frequently applied. Whenever possible, the Institute will state appropriate confidence intervals for effect estimates, including information on whether one- or two-sided confidence limits apply, and on the confidence level chosen. In medical research, the two-sided 95% confidence level is typically applied; in some situations, 90% or 99% levels are used. Altman et al. [13] give an overview of the most common calculation methods for confidence intervals.

In order to comply with the confidence level, the application of exact methods for the interval estimation of effects and risks should be considered, depending on the particular data situation (e.g. very small samples) and the research question posed. Agresti [8] provides an up-to-date discussion on exact methods.

8.3.2. Evaluation of statistical significance

With the help of statistical significance tests it is possible to test hypotheses formulated a priori with control for type 1 error probability. The convention of speaking of a “statistically significant result” when the p-value is lower than the significance level of 0.05 (p<0.05) may often be meaningful. Depending on the research question posed and hypothesis formulated, a lower significance level may be required. Conversely, there are situations where a higher significance level is acceptable. The Institute will always explicitly justify such exceptions.

A range of aspects should be considered when interpreting p-values. It must be absolutely clear which research question and data situation the significance level refers to, and how the statistical hypothesis is formulated. In particular, it should be evident whether a one- or two-sided hypothesis applies [45] and whether the hypothesis tested is to be regarded as part of a multiple hypothesis testing problem [560]. Both aspects, whether a one- or two-sided hypothesis is to be formulated, and whether adjustments for multiple testing need to be made, are a matter of repeated controversy in scientific literature [185,327].

Regarding the hypothesis formulation, a two-sided test problem is traditionally assumed. Exceptions include non-inferiority studies. The formulation of a one-sided hypothesis problem is in principle always possible, but requires precise justification. In the case of a one-sided hypothesis formulation, the application of one-sided significance tests and the calculation of one-sided confidence limits are appropriate. For better comparability with two-sided statistical methods, some guidelines for clinical trials require that the typical significance level should be halved from 5% to 2.5% [298]. The Institute generally follows this approach. The Institute furthermore follows the central principle that the hypothesis formulation (one- or two-sided) and the significance level must be specified clearly a priori. In addition, the Institute will justify deviations from the usual specifications (one-sided instead of two-sided hypothesis formulation; significance level unequal to 5%, etc.) or consider the relevant explanations in the primary literature.

If the hypothesis investigated clearly forms part of a multiple hypothesis problem, appropriate adjustment for multiple testing is required if the type I error is to be controlled for the whole multiple hypothesis problem [40]. The problem of multiplicity cannot be solved completely in systematic reviews, but should at least be considered in the interpretation of results [37]. If meaningful and possible, the Institute will apply methods to adjust for multiple testing. In its benefit assessments (see Section 3.1). The Institute attempts to control type I errors separately for the conclusions on every single benefit outcome. A summarizing evaluation is not usually conducted in a quantitative manner, so that formal methods for adjustment for multiple testing cannot be applied here either.

The Institute does not evaluate a statistically non-significant finding as evidence of the absence of an effect (absence or equivalence) [12]. For the demonstration of equivalence, the Institute will apply appropriate methods for equivalence hypotheses.

In principle, Bayesian methods may be regarded as an alternative to statistical significance tests [523,524]. Depending on the research question posed, the Institute will, where necessary, also apply Bayesian methods (e.g. for indirect comparisons, see Section 8.3.9).

8.3.3. Evaluation of clinical relevance

The term “clinical relevance” refers to different concepts in the literature. On the one hand, at a group level, it may address the question as to whether a difference between 2 treatment alternatives for a patient-relevant outcome (e.g. serious adverse events) is large enough to recommend the general use of the better alternative. On the other hand, clinical relevance is understood to be the question as to whether a change (e.g. the observed difference of 1 point on a symptom scale) is relevant for individual patients. Insofar as the second concept leads to the inspection of group differences in the sense of a responder definition and corresponding responder analyses, both concepts are relevant for the Institute's assessments.

In general, the evaluation of the clinical relevance of group differences plays a particular role within the framework of systematic reviews and meta-analyses, as they often achieve the power to “statistically detect” the most minor effects [569]. In this context, in principle, the clinical relevance of an effect or risk cannot be derived from a p-value. Statistical significance is a statement of probability, which is not only influenced by the size of a possible effect but also by data variability and sample size. When interpreting the relevance of p-values, particularly the sample size of the underlying study needs to be taken into account [461]. In a small study, a very small p-value can only be expected if the effect is marked, whereas in a large study, highly significant results are not uncommon, even if the effect is extremely small [184,279]. Consequently, the clinical relevance of a study result can by no means be derived from a p-value.

Widely accepted methodological procedures for evaluating the clinical relevance of study results do not yet exist, regardless of which of the above-mentioned concepts are being addressed. For example, only a few guidelines contain information on the definition of relevant or irrelevant differences between groups [344,546]. Methodological manuals on the preparation of systematic reviews also generally provide no guidance or no clear guidance on the evaluation of clinical relevance at a system or individual level (e.g. the Cochrane Handbook [264]). However, various approaches exist for evaluating the clinical relevance of study results. For example, the observed difference (effect estimate and the corresponding confidence interval) can be assessed solely on the basis of medical expertise without using predefined thresholds. Alternatively, it can be required as a formal relevance criterion that the confidence interval must lie above a certain “irrelevance threshold” to exclude a clearly irrelevant effect with sufficient certainty. This then corresponds to the application of a statistical test with a shifting of the null hypothesis in order to statistically demonstrate clinically relevant effects [597]. A further proposal plans to evaluate relevance solely on the basis of the effect estimate (compared to a “relevance threshold”), provided that there is a statistically significant difference between the intervention groups [323]. In contrast to the use of a statistical test with a shifting of the null hypothesis, the probability of a type 1 error cannot be controlled thorough the evaluation of relevance by means of the effect estimate. Moreover, this approach may be less efficient. Finally, a further option in the evaluation of relevance is to formulate a relevance criterion individually, e.g. in terms of a responder definition [324]. In this context there are also approaches in which the response criterion within a study differs between the investigated participants by defining individual therapy goals a priori [453].

Patient-relevant outcomes can also be recorded by means of (complex) scales. A prerequisite for the consideration of such outcomes is the use of validated or established instruments. In the assessment of patient-relevant outcomes that have been operationalized by using (complex) scales, in addition to evaluating the statistical significance of effects, it is particularly important to evaluate the relevance of the observed effects of the interventions under investigation. This is required because the complexity of the scales often makes a meaningful interpretation of minor differences difficult. It therefore concerns the issue as to whether the observed difference between 2 groups is at all tangible to patients. This evaluation of relevance can be made on the basis of differences in mean values as well as responder analyses [497]. A main problem in the evaluation of relevance is the fact that scale-specific relevance criteria are not defined or that appropriate analyses on the basis of such relevance criteria (e.g. responder analyses) are lacking [401]. Which approach can be chosen in the Institute's assessments depends on the availability of data from the primary studies.

In order to do justice to characteristics specific to scales and therapeutic indications, the Institute as a rule uses the following hierarchy for the evaluation of relevance, the corresponding steps being determined by the presence of different relevance criteria.

  1. If a justified irrelevance threshold for the group difference (mean difference) is available or deducible for the corresponding scale, this threshold is used for the evaluation of relevance. If the corresponding confidence interval for the observed effect lies completely above this irrelevance threshold, it is statistically ensured that the effect size does not lie within a range that is certainly irrelevant. The Institute judges this to be sufficient for demonstration of a relevant effect, as in this case the effects observed are normally realized clearly above the irrelevance threshold (and at least close to the relevance threshold). On the one hand, a validated or established irrelevance threshold is suitable for this criterion. On the other hand, an irrelevance threshold can be deduced from a validated, established or otherwise well-justified relevance threshold (e.g. from sample size estimations). One option is to determine the lower limit of the confidence interval as the irrelevance threshold; this threshold arises from a study sufficiently powered for the classical null hypothesis if the estimated effect corresponds exactly to the relevance threshold.
  2. If scale-specific justified irrelevance criteria are not available or deducible, responder analyses may be considered. It is required here that a validated or established response criterion was used in these analyses (e.g. in terms of an individual minimally important difference [MID]) [449]. If a statistically significant difference is shown in such an analysis in the proportions of responders between groups, this is seen as demonstrating a relevant effect (unless specific reasons contradict this), as the responder definition already includes a threshold of relevance.
  3. If neither scale-specific irrelevance thresholds nor responder analyses are available, a general statistical measure for evaluating relevance is drawn upon in the form of standardized mean differences (SMD expressed as Hedges' g). An irrelevance threshold of 0.2 is then used: If the confidence interval corresponding to the effect estimate lies completely above this irrelevance threshold, it is assumed that the effect size does not lie within a range that is certainly irrelevant. This is to ensure that the effect can be regarded at least as “small” with sufficient certainty [181].

8.3.4. Evaluation of subjective outcomes in open-label study designs

Various empirical studies have shown that in non-blinded RCTs investigating subjective outcomes, effects are biased on average in favour of the test intervention. These subjective outcomes include, for example, PROs, as well as outcomes for which the documentation and assessment strongly depend on the treating staff or outcome assessors. Wood et al. provide a summary of these studies [600]. According to this such results show a potential high risk of bias. A generally accepted approach to this problem within the framework of systematic reviews does not exist. In this situation the Institute will normally infer neither proof of benefit nor harm from statistically significant results.

One possibility to take the high risk of bias for subjective outcomes in open-label studies into account is the definition of an adjusted decision threshold. Only if the confidence interval of the group difference of interest shows a certain distance to the zero effect is the intervention effect regarded as so large that it cannot only be explained by bias. The usual procedure for applying an adjusted decision threshold is to test a shifted null hypothesis. This procedure has been applied for decades; among other things, it is required in the testing of equivalence and non-inferiority hypotheses [173]. The prospective determination of a specific threshold value is required in the application of adjusted decision thresholds. If applied, the Institute will justify the selection of a threshold value on a project-specific basis by means of empirical data from meta-epidemiological research [489,600].

8.3.5. Demonstration of a difference

Various aspects need to be considered in the empirical demonstration that certain groups differ with regard to a certain characteristic. It should first be noted that the “demonstration” (of a difference) should not be understood as “proof” in a mathematical sense. With the help of empirical study data, statements can only be made by allowing for certain probabilities of error. By applying statistical methods, these probabilities of error can, however, be specifically controlled and minimized in order to “statistically demonstrate” a hypothesis. A typical method for such a statistical demonstration in medical research is the application of significance tests. This level of argumentation should be distinguished from the evaluation of the clinical relevance of a difference. In practice, the combination of both arguments provides an adequate description of a difference based on empirical data.

When applying a significance test to demonstrate a difference, the research question should be specified a priori, and the outcome variable, the effect measure, and the statistical hypothesis formulation should also be specified on the basis of this question. It is necessary to calculate the sample size required before the start of the study, so that the study is large enough for a difference to be detected. In simple situations, in addition to the above information, a statement on the clinically relevant difference should be provided, as well as an estimate of the variability of the outcome measure. For more complex designs or research questions, further details are required (e.g. correlation structure, recruitment scheme, estimate of drop-out numbers, etc.) [46,130].

Finally, the reporting of results should include the following details: the significance level for a statement; a confidence interval for the effect measure chosen (calculated with appropriate methods); descriptive information on further effect measures to explain different aspects of the results; as well as a discussion on the clinical relevance of the results, which should be based on the evaluation of patient-relevant outcomes.

8.3.6. Demonstration of equivalence

One of the most common serious errors in the interpretation of medical data is to rate the non-significant result of a traditional significance test as evidence that the null hypothesis is true [12]. To demonstrate “equivalence”, methods to test equivalence hypotheses need to be applied [313]. In this context, it is important to understand that demonstrating exact “equivalence” (e.g. that the difference in mean values between 2 groups is exactly zero) is not possible by means of statistical methods. In practice, it is not demonstration of exact equivalence that is required, but rather demonstration of a difference between 2 groups that is “at most irrelevant”. To achieve this objective, it must, of course, first be defined what an irrelevant difference is, i.e. an equivalence range must be specified.

To draw meaningful conclusions on equivalence, the research question and the resulting outcome variable, effect measure, and statistical hypothesis formulation need to be specified a priori (similar to the demonstration of a difference). In addition, in equivalence studies the equivalence range must be clearly defined. This range can be two-sided, resulting in an equivalence interval, or one-sided in terms of an “at most irrelevant difference” or “at most irrelevant inferiority”. The latter is referred to as a “non-inferiority hypothesis” [115,298,455].

As in superiority studies, it is also necessary to calculate the required sample size in equivalence studies before the start of the study. The appropriate method depends on the precise hypothesis, as well as on the analytical method chosen [454].

Specifically developed methods should be applied to analyse data from equivalence studies. The confidence interval approach is a frequently used technique. If the confidence interval calculated lies completely within the equivalence range defined a priori, then this will be classified as the demonstration of equivalence. To maintain the level of α = 0.05, it is sufficient to calculate a 90% confidence interval [313]. However, following the international approach, the Institute generally uses 95% confidence intervals.

Compared with superiority studies, equivalence studies show specific methodological problems. On the one hand, it is often difficult to provide meaningful definitions of equivalence ranges [344]; on the other hand, the usual study design criteria, such as randomization and blinding, no longer sufficiently protect from bias [502]. Even without knowledge of the treatment group, it is possible, for example, to shift the treatment differences to zero and hence in the direction of the desired alternative hypothesis. Moreover, the ITT principle should be applied carefully, as its inappropriate use may falsely indicate equivalence [313]. For this reason, particular caution is necessary in the evaluation of equivalence studies.

8.3.7. Adjustment principles and multi-factorial methods

Primarily in non-randomized studies, multi-factorial methods that enable confounder effects to be compensated play a key role [319]. Studies investigating several interventions are a further important field of application for these methods [387]. In the medical literature, the reporting of results obtained with multi-factorial methods is unfortunately often insufficient [38,404]. To be able to assess the quality of such an analysis, the description of essential aspects of the statistical model formation is necessary [245,462], as well as information on the quality of the model chosen (goodness of fit) [273]. The most relevant information for this purpose is usually

a clear description and a-priori specification of the outcome variables and all potential explanatory variables

information on the measurement scale and on the coding of all variables

information on the selection of variables and on any interactions

information on how the assumptions of the model were verified

information on the goodness of fit of the model

inclusion of a table with the most relevant results (parameter estimate, standard error, confidence interval) for all explanatory variables

Depending on the research question posed, this information is of varying relevance. If it concerns a good prediction of the outcome variable within the framework of a prognosis model, a high-quality model is more important than in a comparison of groups, where an adjustment for important confounders must be made.

Inadequate reporting of the results obtained with multi-factorial methods is especially critical if the (inadequately described) statistical modelling leads to a shift of effects to the “desired” range, which is not recognizable with mono-factorial methods. Detailed comments on the requirements for the use of multi-factorial methods can be found in various reviews and guidelines [27,39,319].

The Institute uses modern methods in its own regression analysis calculations [244]. In this context, results of multi-factorial models that were obtained from a selection process of variables should be interpreted with great caution. When choosing a model, if such selection processes cannot be avoided, a type of backward elimination will be used, as this procedure is preferable to the procedure of forward selection [244,535]. A well-informed and careful preselection of the candidate predictor variable is essential in this regard [126]. If required, modern methods such as the lasso technique will also be applied [552]. For the modelling of continuous covariates, the Institute will, if necessary, draw upon flexible modelling approaches (e.g. regression using fractional polynomials [463,488]) to enable the appropriate description of non-monotonous associations.

8.3.8. Meta-analyses

A. General comments

Terms used in the literature, such as “literature review”, “systematic review”, “meta-analysis”, “pooled analysis”, or “research synthesis”, are often defined differently and not clearly distinguished [165]. The Institute uses the following terms and definitions:

A “non-systematic review” is the assessment and reporting of study results on a defined topic, without a sufficiently systematic and reproducible method for identifying relevant research results on this topic. A quantitative summary of data from several studies is referred to as a “pooled analysis”. Due to the lack of a systematic approach and the inherent subjective component, reviews and analyses not based on a systematic literature search are extremely prone to bias.

A “systematic review” is based on a comprehensive, systematic approach and assessment of studies, which is applied to minimize potential sources of bias. A systematic review may, but does not necessarily have to, contain a quantitative summary of study results.

A “meta-analysis” is a statistical summary of the results of several studies within the framework of a systematic review. In most cases this analysis is based on aggregated study data from publications. An overall effect is calculated from the effect sizes measured in individual studies, taking sample sizes and variances into account.

More efficient analysis procedures are possible if IPD are available from the studies considered. An “IPD meta-analysis” is the analysis of data on the patient level within the framework of a general statistical model of fixed or random effects, in which the study is considered as an effect and not as an observation unit.

The Institute sees a “prospective meta-analysis” as a statistical summary (planned a priori) of the results of several prospective studies that were jointly planned. However, if other studies are available on the particular research question, these must also be considered in the analysis in order to preserve the character of a systematic review.

The usual presentation of the results of a meta-analysis is made by means of forest plots in which the effect estimates of individual studies and the overall effect (including confidence intervals) are presented graphically [355]. On the one hand, models with a fixed effect are applied, which provide weighted mean values of the effect sizes (e.g. weighting by inversing the variance). On the other hand, random-effects models are frequently chosen in which an estimate of the variance between individual studies (heterogeneity) is considered. The question as to which model should be applied in which situation has long been a matter of controversy [168,503,574]. If information is available that the effects of the individual studies are homogeneous, a meta-analysis assuming a fixed effect is sufficient. However, such information will often not be available, so that in order to evaluate studies in their totality, an assumption of random effects is useful [504]. Moreover, it should be noted that the confidence intervals calculated from a fixed-effect model may show a substantially lower coverage probability with regard to the expected overall effect, even if minor heterogeneity exists when compared with confidence intervals from a random-effects model [64]. The Institute therefore primarily uses random-effects models and only switches to models with a fixed effect in well-founded exceptional cases. In this context, if the data situation is homogeneous, it should be noted that meta-analytical results from models with random and fixed effects at best show marginal differences. As described in the following text, the Institute will only perform a meta-analytical summary of strongly heterogeneous study results if the reasons for this heterogeneity are plausible and still justify such a summary.

B. Heterogeneity

Before a meta-analysis is conducted, it must first be considered whether the pooling of the studies investigated is in fact meaningful, as the studies must be comparable with regard to the research question posed. In addition, even in the case of comparability, the studies to be summarized will often show heterogeneous effects [266]. In this situation it is necessary to assess the heterogeneity of study results [215]. The existence of heterogeneity can be statistically tested; however, these tests usually show very low power. Consequently, it is recommended that a significance level between 0.1 and 0.2 is chosen for these tests [307,330]. However, it is also important to quantify the extent of heterogeneity. For this purpose, specific new statistical methods are available, such as the I2 measure [265]. Studies exist for this measure that allow a rough classification of heterogeneity, for example, into the categories “might not be important” (0 to 40%), “moderate” (30 to 60%), “substantial” (50 to 90%) and “considerable” (75 to 100%) heterogeneity [124]. If the heterogeneity of the studies is too large, the statistical pooling of the study results may not be meaningful [124]. The specification as to when heterogeneity is “too large” depends on the context. A pooling of data is usually dispensed with if the heterogeneity test yields a p-value of less than 0.2. In this context, the location of the effects also plays a role. If the individual studies show a clear effect in the same direction, then pooling heterogeneous results by means of a random effects model can also lead to a conclusion on the benefit of an intervention. However, in this situation a positive conclusion on the benefit of an intervention may possibly be drawn without the quantitative pooling of data (see Section 3.1.4). In the other situations the Institute will not conduct a meta-analysis. However, not only statistical measures, but also reasons of content should be considered when making such a decision, which must be presented in a comprehensible way. In this context, the choice of the effect measure also plays a role. The choice of a certain measure may lead to great study heterogeneity, yet another measure may not. For binary data, relative effect measures are frequently more stable than absolute ones, as they do not depend so heavily on the baseline risk [205]. In such cases, the data analysis should be conducted with a relative effect measure, but for the descriptive presentation of data, absolute measures for the specific baseline risks may possibly be inferred from relative ones.

In the case of great heterogeneity of the studies, it is necessary to investigate potential causes. Factors that could explain the heterogeneity of effect sizes may possibly be detected by means of meta-regression [547,566]. In a meta-regression, the statistical association between the effect sizes of individual studies and the study characteristics is investigated, so that study characteristics can possibly be identified that explain the different effect sizes, i.e. the heterogeneity. However, when interpreting results, it is important that the limitations of such analyses are taken into account. Even if a meta-regression is based on randomized studies, only evidence of an observed association can be inferred from this analysis, not a causal relationship [547]. Meta-regressions that attempt to show an association between the different effect sizes and the average patient characteristics in individual studies are especially difficult to interpret. These analyses are subject to the same limitations as the results of ecological studies in epidemiology [224]. Due to the high risk of bias, which in analyses based on aggregate data cannot be balanced by adjustment, definite conclusions are only possible on the basis of IPD [438,514,547] (see also Section 8.2.3).

The Institute uses prediction intervals to display heterogeneity within the framework of a meta-analysis with random effects [230,262,451]. In contrast to the confidence interval, which quantifies the precision of an estimated effect, the 95% prediction interval covers the true effect of a single (new) study with a probability of 95%. In this context it is important to note that a prediction interval cannot be used to assess the statistical significance of an effect. The Institute follows the proposal by Guddat et al. [230] to insert the prediction interval – clearly distinguishable from the confidence interval – in the form of a rectangle in a forest plot. The use of meta-analyses with random effects and related prediction intervals in the event of very few studies (e.g. less than 5) is critically discussed in the literature, as potential heterogeneity can only be estimated very imprecisely [262]. The Institute generally presents prediction intervals in forest plots of meta-analyses with random effects if at least 4 studies are available and if the graphic display of heterogeneity is important. This is particularly the case if, due to great heterogeneity, no pooled effect is presented.

Prediction intervals are therefore particularly used in forest plots if no overall effect can be estimated and displayed due to great heterogeneity. In these heterogeneous situations, the prediction interval is a valuable aid in evaluating whether the study effects are in the same direction or not or whether in the former case these effects are moderately or clearly in the same direction (see Section 3.1.4).

C. Subgroup analyses within the framework of meta-analyses

In addition to the general aspects requiring consideration in the interpretation of subgroup analyses (see Section 8.1.6), there are specific aspects that play a role in subgroup analyses within the framework of meta-analyses. Whereas in general subgroup analyses conducted post hoc on a study level should be viewed critically, in a systematic review one still depends on the use of the results of such analyses on a study level if the review is supposed to investigate precisely these subgroups. In analogy to the approach of not pooling studies with too great heterogeneity by means of meta-analyses, results of subgroups should not be summarized to a common effect estimate if the subgroups differ too strongly from each other. Within the framework of meta-analyses, the Institute usually interprets the results of a heterogeneity or interaction test regarding important subgroups as follows: A significant result at the level of α = 0.05 is classified as proof of different effects in the subgroups; a significant result at the level of α = 0.20 is classified as an indication of different effects. If the data provide at least an indication of different effects in the subgroups, then the individual subgroup results are reported in addition to the overall effect. If the data provide proof of different effects in the subgroups, then the results for all subgroups are not pooled to a common effect estimate. In the case of more than 2 subgroups, pairwise statistical tests are conducted, if possible, to detect whether subgroup effects exist. Pairs that are not statistically significant at the level of α = 0.20 are then summarized in a group. The results of the remaining groups are reported separately and separate conclusions on the benefit of the intervention for these groups are inferred [518].

D. Small number of events

A common problem of meta-analyses using binary data is the existence of so-called “zero cells”, i.e. cases where not a single event was observed in an intervention group of a study. the Institute follows the usual approach here; i.e. in the event of zero cells, the correction value of 0.5 is added to each cell frequency of the corresponding fourfold table [124]. This approach is appropriate as long as not too many zero cells occur. In the case of a low overall number of events, it may be necessary to use other methods. In the case of very rare events the Peto odds-ratio method can be applied; this does not require a correction term in the case of zero cells [56,124].

If studies do exist in which no event is observed in either study arm (so-called “double-zero studies”) then in practice these studies are often excluded from the meta-analytic calculation. This procedure should be avoided if too many double-zero studies exist. Several methods are available to avoid the exclusion of double-zero studies. The absolute risk difference may possibly be used as an effect measure which, especially in the case of very rare events, often does not lead to the heterogeneities that otherwise usually occur. A logistic regression with random effects represents an approach so far rarely applied in practice [562]. Newer methods such as exact methods [551] or the application of the arcsine difference [464] represent interesting alternatives, but have not yet been investigated sufficiently. Depending on the particular data situation, the Institute will select an appropriate method and, if applicable, examine the robustness of results by means of sensitivity analyses.

E. Meta-analyses of diagnostic studies

The results of studies on diagnostic accuracy can also be statistically pooled by means of meta-analytic techniques [140,306]. However, as explained in Section 3.5, studies investigating only diagnostic accuracy are mostly of subordinate relevance in the evaluation of diagnostic tests, so that meta-analyses of studies on diagnostic accuracy are likewise of limited relevance.

The same basic principles apply to a meta-analysis of studies on diagnostic accuracy as to meta-analyses of therapy studies [140,447]. Here too, it is necessary to conduct a systematic review of the literature, assess the methodological quality of the primary studies, conduct sensitivity analyses, and examine the potential influence of publication bias.

In practice, in most cases heterogeneity can be expected in meta-analyses of diagnostic studies; therefore it is usually advisable here to apply random-effects models [140]. Such a meta-analytical pooling of studies on diagnostic accuracy can be performed by means of separate models for sensitivity and specificity. However, if a summarizing receiver operating characteristic (ROC) curve and/or a two-dimensional estimate for sensitivity and specificity are of interest, newer bivariate meta-analyses with random effects show advantages [241,448]. These methods also enable consideration of explanatory variables [240]. Results are presented graphically either via the separate display of sensitivities and specificities in the form of modified forest plots or via a two-dimensional illustration of estimates for sensitivity and specificity. In analogy to the confidence and prediction intervals in meta-analyses of therapy studies, confidence and prediction regions can be presented in the ROC area in bivariate meta-analyses of diagnostic studies.

F. Cumulative meta-analyses

For some time it has been increasingly discussed whether, in the case of repeated updates of systematic reviews, one should calculate and present meta-analyses included in these reviews as cumulative meta-analyses with correction for multiple testing [49,65,66,418,548,589]. As a standard the Institute applies the usual type of meta-analyses and normally does not draw upon methods for cumulative meta-analyses.

However, if the conceivable case arises that the Institute is commissioned with the regular update of a systematic review to be updated until a decision can be made on the basis of a statistically significant result, the Institute will consider applying methods for cumulative meta-analyses with correction for multiple testing.

8.3.9. Indirect comparisons

“Methods for indirect comparisons” are understood to be both techniques for a simple indirect comparison of 2 interventions as well as techniques in which direct and indirect evidence are combined. The latter are called “mixed treatment comparison (MTC) meta-analysis” [368-370], “multiple treatment meta-analysis” (MTM) [90], or “network meta-analysis” [372,476].

These methods represent an important further development of the usual meta-analytic techniques. However, there are still several unsolved methodological problems, so that currently the routine application of these methods within the framework of benefit assessments is not advisable [26,208,477,521,537]. For this reason, in its benefit assessments of interventions, the Institute primarily uses direct comparative studies (placebo-controlled studies as well as head-to-head comparisons); this means that conclusions for benefit assessments are usually inferred only from the results of direct comparative studies.

In certain situations, as, for example, in assessments of the benefit of drugs with new active ingredients [136], as well as in health economic evaluations (HEEs, see below), it can however be necessary to consider indirect comparisons and infer conclusions from them for the benefit assessment, taking a lower certainty of results into account.

For the HEE of interventions, conjoint quantitative comparisons of multiple (i.e. more than 2) interventions are usually required. Limiting the study pool to direct head-to-head comparisons would mean limiting the HEE to a single pairwise comparison or even making it totally impossible. In order to enable an HEE of multiple interventions, the Institute can also consider indirect comparisons to assess cost-effectiveness ratios [284], taking into account the lower certainty of results (compared with the approach of a pure benefit assessment).

However, appropriate methods for indirect comparisons need to be applied. The Institute disapproves the use of non-adjusted indirect comparisons (i.e. the naive use of single study arms); it accepts solely adjusted indirect comparisons. These particularly include the approach by Bucher et al. [76], as well as the MTC meta-analysis methods mentioned above. Besides the assumptions of pairwise meta-analyses, which must also be fulfilled here, in MTC meta-analyses sufficient consistency is also required in the effects estimated from the direct and indirect evidence. The latter is a critical point, as MTC meta-analyses provide valid results only if the consistency assumption is fulfilled. Even though techniques to examine inconsistencies are being developed [142,369], many open methodological questions in this area still exist. It is therefore necessary to describe completely the model applied, together with any remaining unclear issues [537]. In addition, an essential condition for consideration of an indirect comparison is that it is targeted towards the overall research question of interest and not only towards selective components such as individual outcomes.

8.3.10. Handling of unpublished or partially published data

In the quality assessment of publications, the problem frequently arises in practice that essential data or information is partially or entirely missing. This mainly concerns “grey literature” and abstracts, but also full-text publications. Moreover, it is possible that studies have not (yet) been published at the time of the Institute's technology assessment.

It is the Institute's aim to conduct an assessment on the basis of a data set that is as complete as possible. If relevant information is missing, the Institute therefore tries to complete the missing data, among other things by contacting the authors of publications or the study sponsors (see Sections 3.2.1 and 7.1.5). However, depending on the type of product prepared, requests for unpublished information may be restricted due to time limits.

A common problem is that important data required for the conduct of a meta-analysis (e.g. variances of effect estimates) are lacking. However, in many cases, missing data can be calculated or at least estimated from the data available [141,275,432]. If possible, the Institute will apply such procedures.

If data are only partly available or if estimated values are used, the robustness of results will be analysed and discussed, if appropriate with the support of sensitivity analyses (e.g. by presenting best-case and worst-case scenarios). However, a worst-case scenario can only be used here as proof of the robustness of a detected effect. From a worst-case scenario not confirming a previously found effect it cannot be concluded that this effect is not demonstrated. In cases where relevant information is largely or completely lacking, it may occur that a publication cannot be assessed. In such cases, it will merely be noted that further data exist on a particular topic, but are not available for assessment.

8.3.11. Description of types of bias

Bias is the systematic deviation of the effect estimate (inferred from study data) from the true effect. Bias may be produced by a wide range of possible causes [99]. The following text describes only the most important types; a detailed overview of various types of bias in different situations is presented by Feinstein [183].

“Selection bias” is caused by a violation of the random principles for sampling procedures, i.e. in the allocation of patients to intervention groups. Particularly in the comparison of 2 groups, selection bias can lead to systematic differences between groups. If this leads to an unequal distribution of important confounders between groups, the results of a comparison are usually no longer interpretable. When comparing groups, randomization is the best method to avoid selection bias [263], as the groups formed do not differ systematically with regard to known as well as unknown confounders. However, structural equality can only be ensured if the sample sizes are sufficiently large. In small studies, despite randomization, relevant differences between groups can occur at random. When comparing groups with structural inequality, the effect of known confounders can be taken into account by applying multi-factorial methods. However, the problem remains of a systematic difference between the groups due to unknown or insufficiently investigated confounders.

Besides the comparability of groups with regard to potential prognostic factors, equality of treatment and equality of observation for all participants play a decisive role. “Performance bias” is bias caused by different types of care provided (apart from the intervention to be investigated). A violation of the equality of observation can lead to detection bias. Blinding is an effective protection against both performance and detection bias [316], which are summarized as “information bias” in epidemiology.

If not taken into account, protocol violations and study withdrawals can cause a systematic bias of study results, called “attrition bias”. To reduce the risk of attrition bias, in studies that aim to show superiority, the ITT principle can be applied, where all randomized study participants are analysed within the group to which they were randomly assigned, independently of protocol violations [316,338].

Missing values due to other causes present a similar problem. Missing values not due to a random mechanism can also cause bias in a result [365]. The possible causes and effects of missing values should therefore be discussed on a case-by-case basis and, if necessary, statistical methods should be applied to account or compensate for bias. In this context, replacement methods (imputation methods) for missing values are only one class of various methods available, of which none are regarded to be generally accepted. For example, EMA recommends comparison of various methods for handling missing values in sensitivity analyses [177].

When assessing screening programmes, it needs to be considered that earlier diagnosis of a disease often results only in an apparent increase in survival times, due to non-comparable starting points (“lead time bias”). Increased survival times may also appear to be indicated if a screening test preferably detects mild or slowly progressing early stages of a disease (“length bias”). The conduct of a randomized trial to assess the effectiveness of a screening test can protect against these bias mechanisms [195].

“Reporting bias” is caused by the selective reporting of only part of all relevant data and may lead to an overestimation of the benefit of an intervention in systematic reviews. If, depending on the study results, some analyses or outcomes are not reported or reported in less detail within a publication, or reported in a way deviating from the way originally planned, then “selective” or “outcome reporting bias” is present [97,160,263]. In contrast, “publication bias” describes the fact that studies finding a statistically significant negative difference or no statistically significant difference between the test intervention and control group are not published at all or published later than studies with positive and statistically significant results [530]. The pooling of published results can therefore result in a systematic bias of the common effect estimate. Graphic methods such as the funnel plot [166] and statistical methods such as meta-regression can be used to identify and consider publication bias. These methods can neither certainly confirm nor exclude the existence of publication bias, which underlines the importance of also searching for unpublished data. For example, unpublished information can be identified and obtained by means of trial registries or requests to manufacturers [347,373,436,529,530].

In studies conducted to determine the accuracy of a diagnostic strategy (index test), results may be biased if the reference test does not correctly distinguish between healthy and sick participants (“misclassification bias”). If the reference test is only conducted in a non-random sample of participants receiving the index test (“partial verification bias”) or if the reference test applied depends on the result of the index test (“differential verification bias”), this may lead to biased estimates of diagnostic accuracy. Cases in which the index test itself is a component of the reference test may lead to overestimates of diagnostic accuracy (“incorporation bias”) [351].

“Spectrum bias” is a further type of bias mentioned in the international literature. This plays a role in studies where the sample for validation of a diagnostic test consists of persons who are already known to be sick and healthy volunteers as a control group [361]. The validation of a test in such studies often leads to estimates for sensitivity and specificity that are higher than they would be in a clinical situation where patients with a suspected disease are investigated [591]. However, the use of the term “bias” (in the sense of a systematic impairment of internal validity) in this connection is unfortunate, as the results of such studies may well be internally valid if the study is conducted appropriately [591]. Nonetheless, studies of the design described above may have features (particularly regarding the composition of samples) due to which they are not informative for clinical questions in terms of external validity.

As in intervention studies, in diagnostic studies it is necessary to completely consider all study participants (including those with unclear test results) in order to avoid systematic bias of results [351]. While numerous investigations are available on the relevance and handling of publication bias in connection with intervention studies, this problem has been far less researched for diagnostic accuracy studies [351].

A general problem in the estimation of effects is bias caused by measurement errors in the study data collected [95,100]. In practice, measurement errors can hardly be avoided and it is known that non-differential measurement errors can also lead to a biased effect estimate. In the case of a simple linear regression model with a classical measurement error in the explanatory variable, “dilution bias” occurs, i.e. a biased estimate in the direction of the zero effect. However, in other models and more complex situations, bias in all directions is possible. Depending on the research question, the strength of potential measurement errors should be discussed, and, if required, methods applied to adjust for bias caused by measurement errors.

8.4. Qualitative methods

8.4.1. Qualitative studies

Qualitative research methods are applied to explore and understand subjective experiences, individual actions, and the social world [146,243,376,405]. They can enable access to opinions and experiences of patients, relatives, and medical staff with respect to a certain disease or intervention.

The instruments of qualitative research include focus groups conducted with participants of a randomized controlled trial, for example. Qualitative data can also be collected by means of interviews, observations, and written documents, such as diaries.

An analysis follows collection of data, which mainly aims to identify and analyse overlapping topics and concepts in the data collected. Among other things, qualitative methods can be used as an independent research method, in the preparation of or as a supplement to quantitative studies, within the framework of the triangulation or mixed-method approach, or after the conduct of quantitative studies, in order to explain processes or results. Qualitative research is seen as a method to promote the connection between evidence and practice [148].

Systematic synthesis of various qualitative studies investigating a common research question is also possible [25,337,395,549]. However, no generally accepted approach exists for the synthesis of qualitative studies and the combination of qualitative and quantitative data [148,149].

A. Qualitative studies in the production of health information

In the development of health information the Institute uses available qualitative research findings to identify (potential) information needs, as well as to investigate experiences with a certain disease or an intervention.

Relevant publications are then selected by means of prespecified inclusion and exclusion criteria, and the study quality is assessed by means of criteria defined beforehand. The results of the studies considered are extracted, organized by topic, and summarized in a descriptive manner for use in the development of health information. The Institute may also take this approach in the production of reports.

In recent years various instruments for evaluating the quality of qualitative studies have been developed [117]. The main task of the Institute in the assessment of qualitative studies is to determine whether the study design, study quality, and reliability are appropriate for the research question investigated. There is a weaker general consensus with regard to the validity of criteria for the conduct, assessment, and synthesis of qualitative studies when compared with other research areas [146,149,243,405].

B. Qualitative studies in the production of reports

Different sources of information can support the integration of systematic reviews [147,356,545]. One possible source are research results from qualitative studies [243,356,406,545]. Qualitative studies seem to be establishing themselves in systematic reviews on the benefit assessment of medical services [146,147,406].

Qualitative research can provide information on the acceptability and suitability of interventions in clinical practice [25,146]. The results of qualitative research can be helpful in the interpretation of a systematic review [545] and may be used in the context of primary studies or systematic reviews on determining patient-relevant outcomes [146,148,337,405,406].

The Institute can use qualitative research findings to identify patient-relevant outcomes, and to present background information on patients' experiences and on the patient relevance of the intervention to be assessed. The Institute can also use these findings in the discussion and interpretation of results of a systematic review.

8.4.2. Consultation techniques

The processing of research questions and tasks commissioned to the Institute often requires the consultation of patients, patient representatives, and national and international experts. To do this the Institute uses various consultation techniques.

In the production of reports, the Institute uses these techniques to identify patient-relevant outcomes and to involve national and international experts, and also uses them in the Institute's formal consultation procedure. In the development of health information, consultation techniques serve to involve patients and patient representatives in the identification of information needs, the evaluation of health information, and during consultation.

The Institute uses the following consultation techniques:

key informant interviews [565], e.g. interviews with patient representatives to identify patient-relevant outcomes

group meetings and consultations [407,411,412], e.g. within the framework of scientific debates on the Institute's products

group interviews and focus groups [146,565], e.g. with patients with respect to the evaluation of health information

surveys and polling (including online polling and feedback mechanisms), e.g. to identify information needs of readers of www​.gesundheitsinformation.de/www​.informedhealthonline.org

If a deeper understanding of experiences and opinions is necessary, then the Institute should use the scientific findings obtained from qualitative research. The use of consultation techniques and the involvement of experts are associated with an additional use of resources. However, the involvement of patients in research processes enables the consideration of patient issues and needs as well as the orientation of research towards these issues and needs [424].

Footnotes

33

Preferred Reporting Items for Systematic Reviews and Meta-Analyses

34

Transparent Reporting of Evaluations with Non-randomized Designs

35

Strengthening the Reporting of Observational Studies in Epidemiology

36

Meta-analysis of Observational Studies in Epidemiology

37

International Society of Quality of Life Research

38

GKV-Modernisierungsgesetz, GMG

39

Assessment of Multiple Systematic Reviews

Copyright © 2015 by the Institute for Quality and Efficiency in Healthcare (IQWiG).
Bookshelf ID: NBK385789

Views

More in this collection

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...