The SF-36 is a generic health assessment questionnaire that has been used in clinical trials to study the impact of chronic disease on health-related quality of life (HRQoL). The SF-36 consists of 36 items representing eight dimensions: physical functioning, role physical, bodily pain, general health, vitality, social functioning, role emotional, and mental health. Item response options are presented on a 3-point to 6-point Likert-like scale. Each item is scored on a 0 to 100 range, and item scores are averaged together to create the eight domain scores. The SF-36 also provides two component summaries, the physical component summary and the mental component summary, which are created by aggregating the eight domains according to a scoring algorithm. Therefore, the physical and mental component summaries and eight dimensions are each measured on a scale of 0 to 100, which are t-scores (mean of 50 and standard deviation of 10) that have been standardized to the US general population.26 Thus, a score of 50 on any scale would be at the average or norm of the general US population and a score 10 points lower (i.e., 40) would be one standard deviation below the norm.26 On any of the scales, an increase in score indicates improvement in health status. In general use of the SF-36 version 2.0 (SF-36v2), the User’s Manual proposed the following minimal important differences (MIDs): a change of 2 points on the physical component summary and 3 points on the mental component summary. The manual also proposes the following minimal mean group differences, in terms of t-score points, for SF-36v2 individual dimension scores: physical functioning, 3; role physical, 3; bodily pain, 3; general health, 2; vitality, 2; social functioning, 3; role emotional, 4; and mental health, 3. It should be noted that these MID values were determined as appropriate for groups with mean t-score ranges of 30 to 40; for higher t-score ranges, values may be higher.26 MID values do not represent patient-derived scores. The MIDs for the SF-36v2 are based on clinical and other non–patient-reported anchors.26
Two versions of the SF-36 exist: the original and version 2.0 (SF-36v2 was made available in 1996).26 The SF-36v2 contains minor changes to the original survey, including changes to instructions (reduced ambiguity), questions and answers (better layout), item-level response choices (increased), and cultural and language comparability (increased), and the elimination of a response option from the items in the mental health and vitality dimensions.26
One study has investigated benchmarks for MIDs for 1-point lower SF-36 scores in populations with diabetes.37 SF-36 surveys of three general US patient populations were analyzed to derive statistical models using non–patient-reported anchors of two-year mortality, seven-year mortality, ability to work, hospitalization within six months from baseline, and loss of ability to work within six months from baseline. The authors accounted for certain variables, including age, number of comorbidities, education, marital status, and score levels as well as interactions and nonlinear effects in their analyses. The three surveys produced different outcome risks associated with 1-point changes in SF-36 dimension and component scores. The models were then applied to the diabetes subpopulations within each patient population to estimate the relative risk associated with each outcome and a 1-point hypothetical decrease in SF-36 scores. Different risks were associated with each population and each outcome. For example, using the Medicare Health Outcome Survey, 1-point lower dimension and component scores were associated with increased risks of two-year mortality ranging from 1.8% to 6.4%, while the Medical Outcomes Survey data generated increased risks of seven-year mortality ranging from 2.0% to 9.0%. One-point lower scores using Medical Outcomes Survey data were associated with a six-month increased risk of hospitalization ranging from not statistically significant to 3.7% and an increased risk of losing the ability to work within six months of baseline ranging from 2.8% to 6.9%.37 While MID benchmarks can be helpful in interpreting SF-36 scores in the absence of minimal clinically important differences (MCIDs), the magnitude of the increased risk, while statistically significant, can be difficult to interpret from a clinical and patient perspective. Furthermore, the 1-point score decrease associated with a small risk of hospitalization within a six-month time frame is difficult to interpret as clinically meaningful. Finally, the study failed to adjust for potentially important confounding variables relating to diabetes, including disease type (type 1 diabetes mellitus [T1DM] versus type 2 diabetes mellitus [T2DM]), disease duration, treatment type, glycemic control, lifestyle factors (such as smoking), and socioeconomic factors (such as income level).37 As such, the validity of these 1-point score difference benchmarks remains unclear.
Validation of the Short Form (36) Health Survey in Type 1 and Type 2 Diabetes
Validation of the SF-36 has been performed in a number of studies in T1DM and T2DM combined populations38–41 and in T2DM general populations in Germany (N = 144),42 the United Kingdom (N = 131),43 Pima Indian adults (N = 54),44 older Chinese adults (N = 182),45 and US veterans (N = 331; 98% male).46 All validation studies were performed in male and female adults; none assessed the SF-36 in T1DM patients exclusively. Validation tests in these populations are described in the sections below.
Reliability
Cronbach’s alpha correlation coefficients measure internal consistency and reliability, conveying how well an item relates to its hypothesized dimension. Alpha coefficients varied according to study and population with some ranges reporting internal reliability ≥ 0.7 to 0.94 for all dimensions,41,43 while others found some dimensions to have lower reliability: social functioning,38,44 role emotional, role physical, vitality,42 and general health.39,42,45 Internal reliability discrepancies (dimensions with alpha lower than 0.7) may relate to the specific characteristics, health states, and socioeconomic or cultural traits of the population used to validate the instrument. No dimensions were found to have alpha coefficients ≥ 0.95, though some exceeded 0.9 (higher alpha coefficients may suggest redundancy).
One US study of the adult population (18 to 60 years of age; 64% T1DM, 31% T2DM) measured test-retest reliability by comparing baseline to six-month surveys. All correlations were positive, but ranges of coefficients were reported for the different dimensions: 0.902 for physical function; > 0.6 to 0.9 for social function, role physical, role emotional, mental health, vitality, and general health perception; and 0.433 for pain. As a reference point of the measure of maintenance of health state, a diabetes-specific health status questionnaire served as a reference point for each patient at both time points, with a correlation coefficient of 0.827.40 Test-retest reliability was also measured in a German population of T2DM (approximately 70 patients, approximately 50% taking insulin, approximately 50% male) within one to three days of the original test. Measures of internal consistency at both time points were captured, but no correlations were calculated. Internal consistency ranged from 0.67 to 0.96 at baseline and from 0.61 to 0.89 at retest. Upon retest, some dimensions were more affected than others, including role emotional and role physical (both lower) and general health (higher).42
Responsiveness has been assessed in a single study of 331 US veterans (98% male, mean age of 63.5 years, 91% T2DM). The observational, prospective study of 25 diabetic complications, sampled at two time points over a mean interval of 3.1 years, was powered to detect a minimum difference of 5 points across all dimensions of the SF-36 and used Cohen’s effect size to evaluate responsiveness (effect size ranges were defined as “trivial” [< 0.20], “small” [≥ 0.20 to < 0.50], “moderate” [≥ 0.50 to < 0.80], or “large” [≥ 0.80]).46 Six of the SF-36 dimensions (general health, physical functioning, social functioning, role physical, bodily pain, and vitality) were found to be responsive when patients who developed two or more complications were compared with those who were stable or improved (effect size 0.31 to 0.66); an increase of more than one complication was associated with a loss of 4.1 points to 23.6 points on these six scales. Statistically significant changes in SF-36 dimension scores were related to any renal complication in five of these six dimensions (general health, physical functioning, social functioning. role physical, vitality) or to any neuropathy complication in four of them (general health, physical functioning, role physical, vitality).46
Validity
Two cross-sectional studies conducted in Taiwan38 and mainland China45 primarily studying T2DM patients (mean ages 6338 and 6945) evaluated the internal validity of the SF-36 by factor analysis (eigenvalues ≥ 1.0 and factor loadings ≥ 0.4 were significant). In one study, all dimensions loaded onto their hypothesized component summary (physical or mental).38 In the other study, factor analysis revealed appropriate loading except for general health on the mental component summary and role physical on both the mental component summary and the physical component summary. Item-dimension correlations ranged from 0.27 to 0.81 across all dimensions and summary scores; only the physical functioning dimension had a scaling of success rates < 100% (physical functioning, 99%).45 In a large observational cohort study of chronic disease in the United States (T1DM and T2DM subgroup, N = 624), item-dimension correlations ranged from 0.62 to 0.76 in all but the general health (0.38 to 0.71) and physical functioning (0.52 to 0.82).41 Scaling success rates from 280 tests, based on item correlation with hypothesized dimension exceeding that of all others by more than two standard errors, were 100% in all but general health (90%) and physical functioning (99%).41 General health was found to correlate with both the physical component summary and the mental component summary during SF-36 development.26
Inter-dimension correlations of the SF-36 in a T1DM and T2DM patient population ranged from 0.179 (mental health correlation with physical functioning) to 0.637 (role physical with pain),40 suggesting that different dimensions are measuring somewhat different constructs.
One challenge when validating a pre-established, generic HRQoL instrument for use in a specific disease population is in the identification of appropriate measures against which to test the instrument (construct validity) when no gold standard is available (criterion validity). A number of studies have assessed the association between glycated hemoglobin (A1C), a known surrogate marker in both forms of diabetes, and SF-36 dimensions, or have performed known-group comparisons based on A1C level stratification. These studies have established that there is no clear relationship between dimensions of the SF-36 and A1C levels, reporting unexpected, poor, or negligible correlations44,47 or an inability of the SF-36 to discriminate between known groups based on A1C levels.38,40 An initial study comparing physician assessment of patient health to the patient-reported SF-36 dimension scores reported unsatisfactory correlations (0.39 to 0.64).40 Construct validity testing was based on exploratory and a priori hypotheses. The SF-36 showed evidence of measuring effects of diabetic complications,43 treatment type, and changes following diabetes interventions,42,44 but it was also influenced by non-diabetic comorbidity43,44 and other non–diabetes-specific factors, such as age:43,44
Age: Physical functioning, role physical, social functioning, and mental health deteriorated in older age groups (Spearman rank correlation coefficients, −0.52 to −0.40; P < 0.005)
44; physical functioning and role physical were impaired in older age groups (P < 0.05),
43 but role emotional was impaired in younger age groups (P < 0.01).
43Sex: No statistically significant differences were found.
43,44 Women had lower scores on multiple dimensions (P < 0.05).
42Education level: No correlation was found.
44Socioeconomic status and income: No statistically significant differences were found.
43,44Diabetes-related laboratory markers: No correlation was found.
44Diabetic complications: These were associated with lower dimension scores for social functioning, role emotional, vitality (P < 0.01), and role physical (P < 0.05)
43; all dimensions of the SF-36 were lower with more than one late complication (P < 0.01).
42Non-diabetic comorbidities: These showed lower scores in physical functioning and role physical (both with P = 0.001), vitality and general health (P < 0.01), and mental health (P < 0.05).
43Comorbidities: These showed lower scores in physical functioning, role physical, role emotional, mental health, and social functioning (Spearman rank correlation coefficients, −0.42 to −0.32; P < 0.02).
44Diabetic treatment: Insulin was associated with lower scores than non-insulin treatment in physical functioning, role physical, social functioning, and general health (Spearman rank correlation coefficients, 0.31 to 0.40; P < 0.03),
44 and in vitality and mental health (P < 0.01).
42Response to diabetes intervention (treatment, education, or both) showed statistically significant score changes for the following dimensions: Role physical, general health, vitality, and social functioning (P < 0.05 or less).
42
Validity of the SF-36 dimensions was also evaluated using diabetes-specific HRQoL measures. The Audit of Diabetes Dependent Quality of Life is a validated tool for measuring the impact of diabetes on general quality of life across 13 domains. SF-36 correlated better with this tool in patients without any other disease or comorbidity than in those with comorbidities (Spearman’s rank coefficients, 0.30 to 0.44) across five domains: social functioning, role physical, mental health, vitality, and general health (P < 0.05).43 Another study compared validation of the SF-36 with Diabetes-39, a five-dimension measure consisting of 39 items that probe diabetes-related HRQoL.38 The SF-36 performed better than Diabetes-39 on some dimensions and in the physical component summary for cardiovascular disease and cerebrovascular complications (Cohen’s effect sizes highest in the physical dimensions) and for the diabetic all-complication summary known-group comparison: effect sizes of SF-36 were 0.38 compared with the Diabetes-39 summary score of 0.15. The Diabetes-39 had discriminative power over the SF-36 (based on C-statistic) in two-hour post-prandial glucose (0.7 versus 0.63; P < 0.05); the SF-36 generally performed better than the Diabetes-39 for complication-known groups. SF-36 dimensions performed better at a statistically significant level than Diabetes-39 subscales for cardiovascular disease and the all-complication–known groups;38 in the German T2DM population, SF-36 showed statistically significant multidimensional changes after diabetes intervention when the Diabetes-39 did not.42 Based on a priori hypotheses, known-groups comparisons of self-reported high blood pressure, heart problems, and measured depression levels showed significantly higher SF-36 dimension scores for patients without high blood pressure, heart problems, or moderate to high depressive levels (no effect sizes presented).45
Critical Appraisal
The SF-36 requires further and more comprehensive validation across the combined population of T1DM and T2DM patients, across different ethnic and cultural populations, and in T1DM patient groups. As SF-36 was developed as a generic instrument, it has been suggested that the tool be evaluated and possibly revalidated whenever a new study is undertaken in any diabetes population, as some items and dimensions of the SF-36 did not respond optimally during validation in various groups.38,45 Furthermore, in the CADTH Common Drug Review (CDR) search of the literature, few studies were identified that attempted to validate the test-retest reliability,40,42 responsiveness,46 or MCID of the SF-36 in the general diabetes population, in separate T1DM and T2DM populations, and in more specific diabetes subgroups. The SF-36 has shown evidence of measuring the effects of diabetic complications,42,43 but it is also influenced by non-diabetic comorbidity43,44 and other non–diabetes-specific factors, such as age.43,44 It does not demonstrate evidence of association with surrogate markers of disease severity,38,40,44,47 but does respond to treatment type and changes following diabetes interventions.42,44 The SF-36 and diabetes-specific instruments likely provide some degree of overlap, but also address different features of a patient’s overall HRQoL.38,45 Taken together, the evidence suggests that the SF-36 is not likely an appropriate stand-alone tool for the evaluation of all facets of HRQoL in patients with diabetes, but can provide useful insight when used in combination with the appropriate, complementary diabetes-specific treatment evaluation and HRQoL instruments. Comprehensive validation of the SF-36 in T1DM and T2DM is incomplete, and no MCID specifically in diabetes has been established.