Validity of Outcome Measures

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Clinical Review Report: Levodopa/Carbidopa (Duodopa): (Abbvie Corporation): Indication: For the treatment of patients with advanced levodopa-responsive Parkinson’s disease who do not have satisfactory control of severe, debilitating motor fluctuations and hyper-/dyskinesia despite optimized treatment with available combinations of Parkinson’s medicinal products, and, for whom the benefits of this treatment may outweigh the risks associated with the insertion and long-term use of the percutaneous endoscopic gastrostomy-jejunostomy (PEG-J) tube required for administration [Internet]. Ottawa (ON): Canadian Agency for Drugs and Technologies in Health; 2018 Sep.

Cover of Clinical Review Report: Levodopa/Carbidopa (Duodopa)

Clinical Review Report: Levodopa/Carbidopa (Duodopa): (Abbvie Corporation): Indication: For the treatment of patients with advanced levodopa-responsive Parkinson’s disease who do not have satisfactory control of severe, debilitating motor fluctuations and hyper-/dyskinesia despite optimized treatment with available combinations of Parkinson’s medicinal products, and, for whom the benefits of this treatment may outweigh the risks associated with the insertion and long-term use of the percutaneous endoscopic gastrostomy-jejunostomy (PEG-J) tube required for administration [Internet].

Show details

Contents

< Prev Next >

Appendix 5Validity of Outcome Measures

Aim

To summarize the validity of the following outcome measures:

Parkinson disease home diary (PDHD)
39-Item Parkinson’s Disease Questionnaire (PDQ-39)
Clinical Global Impression – Improvement (CGI-I)
Unified Parkinson’s Disease Rating Scale (UPDRS)
EuroQol 5-Dimensions 3-Levels (EQ-5D-3L) questionnaire
Zarit Burden Interview (ZBI)

Findings

Table 24Validity and Minimal Important Differences of Outcome Measures

Instrument	Type	Evidence of Validity	MCID	References
Parkinson Disease Home Diary	The PDHD is a PD diary for patients experiencing motor fluctuations and dyskinesia. It records the amount of “on” and “off” time over 24 hours. It consists of 5 categories: (1) asleep, (2) “off,” (3) “on” without dyskinesia, (4) “on” with non-troublesome dyskinesia, and (5) “on” with troublesome dyskinesia.	Construct validity and reliability evaluated in a single study	“Off” time: 1 hour (patients with motor fluctuation) 1 to 1.3 hours (APD)	⁶¹^,⁶³^,⁷⁷
Parkinson’s Disease Questionnaire-39	The PDQ-39 is a disease-specific HRQoL measure consisting of eight domains (mobility, activities of daily living, emotional well-being, stigma, social support, cognition, communication, and bodily discomfort) graded on a 5-point scale (0 = never; 4 = always).	Yes	Total score: −1.6 Mobility: −3.2 ADL: −4.4 Emotional well-being: −4.2 Stigma: −5.6 Social support: −1.4 Cognition: −1.8 Communication: −4.2 Pain: −2.1	⁶²^,⁷⁸^–¹⁰⁶
Clinical Global Impression	The CGI is a generic assessment of the clinician’s view of the patient’s global functioning, and consists of 3 components: Severity of Illness (CGI-S), Global Improvement (CGI-I), and Efficacy Index (CGI-E). The CGI-I subscale is graded on a 7-point scale (1 = very much improved; 7 = very much worse).	No information on the validity and reliability of the CGI-I subscale	Unknown in PD patients	¹⁰⁷^,¹⁰⁸
Unified Parkinson’s Disease Rating Scale	Measure of disability and impairment in PD. Four parts: Part I (mentation, behaviour, and mood: four items, score 0 to 16); Part II (activities of daily living; 13 items, score 0 to 56 for each state); Part III (motor examination; 14 items, score 0 to 108); and Part IV (complications of therapy in past week; 11 items, score 0 to 23). Total score from 0 (best) to 199 (worst).	Yes	Total score: 3 to 8 points (EPD); 4.1 to 17.8 (all stages) Part II: 0.5 to 3.0 points (EPD) and 1.8 to 2.3 points (APD) Part III: 2.0 to 6.2 points (EPD) and 5.2 to 6.5 points (APD)	⁶¹^,⁶³^,¹⁰⁹^–¹²⁶
EuroQol 5-Dimensions 3-Levels	The EQ-5D-3L is a generic, preference-based, HRQoL measure consisting of 5 dimensions: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. Each dimension has 3 levels representing no problems (1), some problems (2), and extreme problems (3).	Yes	Index score range: 0.10 to 0.11	⁶⁴^,¹²⁷^–¹³⁰
Zarit Burden Interview	The ZBI is a generic self-reported measure to assess caregiver burden. It consists of 22 items measured on a 5-point Likert scale. Scores of individual items range from 0 (never) to 4 (nearly always); total score ranges between 0 and 88, with a higher score representing more burden.	Yes, limited information on reliability	Unknown in caregivers of patients with PD	³⁵^,¹³¹^–¹³⁵

: ADL = activities of daily living; APD = advanced Parkinson disease; CGI-E = Clinical Global Impression – Efficacy Index; CGI-I= Clinical Global Impression – Improvement; CGI-S= Clinical Global Impression – Severity of Illness; EPD = early Parkinson disease; EQ-5D-3L = EuroQol 5-Dimensions 3-Levels questionnaire; HRQoL = health-related quality of life; MCID = minimal clinically important difference; PD = Parkinson disease; PDHD = Parkinson disease home diary; PDQ-39 = 39-item Parkinson’s Disease Questionnaire; ZBI = Zarit Burden Interview.

Evidence from validation studies is summarized for all instruments according to the following metrics and depending on information availability: comprehensiveness (i.e., how well the measure captures the areas of health-related quality of life [HRQoL] relevant to patients with Parkinson disease [PD]); feasibility (i.e., duration and ease of administration in different settings); validity (i.e., content, construct [convergent, discriminant], and criterion [concurrent, predictive] validity); reliability (internal consistency; i.e., inter-item correlations); reproducibility (i.e., test–retest [inter/intra-rater] reliability); responsiveness (sensitivity to detecting meaningful changes over time); floor and ceiling effects (i.e., the extent to which respondents score at the bottom or top of a scale); and scaling assumptions (i.e., correctly grouping items into scales and summing them to produce a score with or without weighing or standardizing).

Interpretation of the reliability and validity metrics were based on the following criteria:

Inter/intra-rater reliability/agreement (kappa statistics or interclass coefficient [ICC]): < 0 to 0.2 = poor; 0.21 to 0.4 = fair; 0.41 to 0.6 = moderate; 0.61 to 0.8 = substantial; 0.81 to 1.00 = almost perfect agreement¹³⁶
Internal consistency (Cronbach’s alpha) and test–retest reliability (≥ 0.7 is considered acceptable)¹³⁷
Validity; i.e., between-scale comparison (correlation coefficient, r): ≤ 0.3 = weak, 0.3 to ≤ 0.5 = moderate, > 0.5 = strong)¹³⁸

The Parkinson Disease Home Diary

The PDHD is a self- or caregiver-administered diary that patients with PD experiencing motor fluctuations and dyskinesia can fill out during their participation in a clinical trial. This diary aims to assess the amount of “on” and “off” time that patients experience in a 24-hour period.⁷⁷ The PDHD consists of five distinctive states: (1) asleep; (2) “off;” (3) “on” without dyskinesia; (4) “on” with non-troublesome dyskinesia; and (5) “on” with troublesome dyskinesia. ⁷⁷

The PDHD was only validated in a single study conducted among 302 patients from 10 countries with idiopathic PD who were experiencing motor fluctuations and dyskinesia and capable of accurately completing the diaries.⁷⁷ The diary was shown to be both feasible and simple in its use (with an 83% completion rate without duplication or error);⁷⁷ however, errors and non-compliance were more prevalent after three days of use. The adaptability of the PDHD for non-English speakers was further demonstrated after the diary was translated from English into study participants’ native languages. The PDHD displayed substantial reliability (mean ICC: 0.71; mean correlation between any two days: 0.74); test–retest reliability improved as the number of diary days increased (Cronbach’s alpha > 0.8 for comparisons between two or more diary days). Finally, a moderate correlation was observed between other visual analogue scale (VAS) measures and corresponding PDHD measures when they were compared (Pearson correlation coefficients ranged from 0.36 to 0.57); thus, acceptable construct (concurrent) validity was shown.⁷⁷ The major limitation of the study includes comparing the validity of PDHD against only one patient-reported functional measure (VAS) without other external measures of function and disability.

Minimal Clinically Important Difference

Hauser et al. analyzed data from a placebo-controlled, randomized controlled trial (RCT) that included patients with motor fluctuations (N = 472); the total duration was six months. Using the Clinical Global Impression — Improvement (CGI-I) as an anchor, they reported a minimal clinically important difference (MCID) of a one-hour reduction in self-rated “off” time.⁶¹ Using data from another double-dummy, placebo-controlled RCT among advanced PD patients (N = 517, 18 weeks), Hauser et al. estimated the MCID for “off” time with CGI-I and Patient Global Impression – Improvement (PGI-I) as anchors. Among advanced PD patients, the MCID for “off” time ranged from 1 hour to 1.3 hours.⁶³

39-Item Parkinson’s Disease Questionnaire

The PDQ-39 is one of the most commonly used PD-specific HRQoL measures. Its measurement properties have been studied extensively and it has been recommended for use by the Movement Disorder Society (MDS).⁸⁶ The PDQ-39 is a self-administered questionnaire consisting of 39 items that measure eight domains of health: mobility (10 items), activities of daily living (ADL; six items), emotional well-being (six items), stigma (four items), social support (three items), cognition (four items), communication (three items), and bodily discomfort (three items).⁷⁸ Each item is graded on a 5-point Likert scale (0 = never; 4 = always); the items are then added to generate the respective domain scores. Each domain is coded on a scale of 0 (no problem at all) to 100 (maximum level of a problem). Further, an overall single summary index (the PD summary index [PDQSI]) representing the global HRQoL can be created by averaging the eight subscale scores. The PDQSI is also coded on a scale ranging from 0 to 100, with higher scores indicating worse quality of life (QoL).⁷⁸^,⁸²

The psychometric properties of the domain and index scores of the PDQ-39 have been extensively evaluated in many studies across different geographic locations and with different languages, including Chinese, Estonian, French, Japanese, Korean, Portuguese, Persian, Spanish, and Swedish.⁸⁷^–¹⁰⁶ Only evidence for the English version of the scale is summarized here.

One study by Damiano et al. ⁸⁰ assessed the comprehensiveness of PDQ-39 in a clinical trial setting based on literature review and through consultation with clinicians and patients. The authors found that the PDQ-39 measures 10 out of 12 areas of HRQoL identified as relevant to PD patients other than self-image and sexual function.⁸⁰

The Damiano et al. study reported that the PDQ-39 has a short administration time — estimated to be less than 30 minutes — and can be uniformly administered by patients, interviewers, or caregivers.⁸⁰ One study by Jenkinson et al. validated the PDQ-39 in a cross-cultural study across five countries (including Canada and the US).⁸³ As with the previous study, a high completion rate (> 82%) and low percentage of missing scores (< 5%) was reported for both domain and index scores.⁸³ Additionally, assessments of the validity of the PDQ-39 have been conducted in different settings, including clinic-based, community-based, and longitudinal samples, making the interpretation more generalizable.⁸⁰

The UK-based research group that developed the scale assessed the reliability of the PDQ-39 and PDQSI internally with other domain scores, and found an acceptable internal consistency (Cronbach’s alpha > 0.7 and > 0.8, respectively), indicating the items performed well enough together to be a composite score. The test–retest reliability (range: 0.68 to 0.94) was high.⁷⁸^,⁸² A US study adapted the British version into a US version and found high test–retest reliability (range 0.86 to 0.95) as well as corroborating psychometric properties, with Cronbach’s alpha > 0.7 for all but one of the domains (social support, alpha = 0.51).⁷⁹ Similarly, adequate internal consistency was reported by Damiano et al.⁸⁴ (Cronbach’s alpha ≥ 0.7 across domains and 0.85 for PDQSI), with the exception of social support (Cronbach’s alpha = 0.57). Findings from the cross-national validation study were similar, with generally adequate internal consistency for all domains (Cronbach’s alpha ≥ 0.7) except social support.

The developers of the PDQ-39 documented the construct (specifically convergent) validity of the individual domain score of the scale in comparison with other patient-reported measures of ill health, namely the Columbia Rating Scale and the Hoehn and Yahr (HY) score. While moderate to strong correlations were found between the scales for dimensions measuring physical aspects of health status (mobility and ADL, Spearman’s correlation, r > 0.5), psychosocial aspects had weak correlations (emotions, stigma, and social, r < 0.3)⁸³. In contrast, correlations between related domain scores of the PDQ-39 and Short Form (36) Health Survey (SF-36) were strong (−0.66 ≤ r ≤ −0.8).⁸¹ The US-based study reported similar findings, with strong correlations between related domain scores of the PDQ-39 and SF-36 (−0.59 ≤ r ≤ −0.88), with the exception of the subscale measuring social support (r = −0.22).⁷⁹ In addition, the PDQ-39 generally had strong correlations with five measures of symptoms severity (tremor, stiffness, slowness, freezing, and jerking) as measured by related scales of the SF-36 (0.21 ≤ r ≤ 0.74).⁷⁹ Concurrent validity in the English version of the PDQ-39 was only assessed by Harrison et al. by comparing the performance of PDQ-39 with other established measures of disease severity, depression, and anxiety.⁸⁵ Domains of the PDQ-39 that were related to the Beck depression inventory scores, Barthel index, and the Royal Postgraduate Medical School severity scale had moderate to strong correlations (r ranged from 0.3 to 0.73).⁸⁵

The US-based study assessed the discriminative ability of the PDQ-39 by measuring the scale’s ability to discriminate between the stages of PD. Respondents consistently indicated a significantly higher score for each domain of the PDQ-39 with progressive worsening of five measures of symptoms severity (tremor, stiffness, slowness, freezing, and jerking).⁷⁹ The discriminative ability was further demonstrated by Damiano et al. where higher (poorer) PDQ-39 domain and index scores were associated with more severe HY stages and dyskinesia as well as the presence of comorbidities.⁸⁴

The developers of the PDQ-39 reported moderate responsiveness for two of its domains (standardized mean change over time: 0.55 and 0.43 for mobility and ADL, respectively); responsiveness for the other six domains was low. ⁸⁰ Harrison et al. assessed the comparative responsiveness of the PDQ-39 and other established measures of mood and motor function (the General Health Questionnaire-28 and the Office of Population and Census Surveys disability instrument) in a UK population.⁸⁵ Results from their study showed a superior responsiveness to of the PDQ-39 and its subscales to change over time (except domains involving emotion and bodily discomfort).

Damiano et al. evaluated floor and ceiling effects on patients with varying degrees of PD severity using self-completed and telephone interview versions of the PDQ-39. Both modes of administration generally showed low floor and ceiling effects across different domains (range: 0.0% to 6.1% for floor effects and 1.5 to 31.3% for ceiling effects); these were essentially eliminated by the index score. However, the stigma and social support subscale had noticeably high ceiling effects, indicating that a high proportion of study participants had maximum scores for these two domains.⁸⁴ These findings were consistent with the cross-national validation study,⁸³ where generally low floor and ceiling effects were seen across different domains (< 15% and 5%, respectively). However, the stigma and social support domains had large floor effects (> 20% and > 50%, respectively), indicating that a substantial proportion of the study participants scored at the floor (i.e., zero); but the floor effect was virtually eliminated by the index score.

The only study examining the scaling assumption in the English version of the PDQ-39 was the cross-national study by Jenkinson et al.⁸³ The authors reported a higher-order factor analysis to create a single index score, the PDQSI. The index score had eigenvalues greater than 1, and explained > 50% of the variance, supporting the scaling assumptions.

Minimal Clinically Important Difference

One study by the original research group that developed the PDQ-39 scale investigated the minimal important difference (MID) for the index score and across different domains. A postal survey of randomly selected patients from 13 local branches of the UK Parkinson’s Disease Society was conducted; the response rate was 53% (N = 728). No information on PD severity or anchoring were provided. Findings from the study showed a varying mean MID for different domains: mobility (−3.2), ADL (−4.4), emotional well-being (−4.2), stigma (−5.6), social support (−11.4), cognition (−1.8), communication (−4.2), pain (−2.1), and overall score (−1.6).⁶²

Clinical Global Impression – Improvement

The CGI scale is a generic measure that provides a brief, overall assessment of the clinician’s view of the patient’s global functioning, which is used to compare before- and after-treatment changes. It consists of three components: Severity of Illness (CGI-S), Global Improvement (CGI-I), and the Efficacy Index (CGI-E). Scores on the Severity of Illness subscale range from 1 (“not ill at all”) to 7 (“among the most extremely ill”). The Global Improvement subscale is also rated on a 7-point scale where 1 = very much improved since the initiation of treatment; 2 = much improved; 3 = minimally improved; 4 = no change from baseline (the initiation of treatment); 5 = minimally worse; 6 = much worse; and 7 = very much worse since the initiation of treatment. Both the severity and improvement subscales are companion one-item measures that generally track with each other such that improvement in one follows the other. However, the anchors for scoring are different; the CGI-I measures changes from baseline/pre-treatment, whereas the CGI-S measure changes from prior measure (which can be in the preceding week). Consequently, the two scores can occasionally be dissociated if there is a substantial lapse between the measures.¹⁰⁸^,¹³⁹

No validation study was found for CGI-I specifically; and one international cross-sectional study by Martínez-Martín et al. evaluated the validity of CGI-S among PD patients.¹⁰⁷ In order to assess the construct validity, the CGI-S subscale was correlated with the HY stage, the Clinical Impression of Severity Index for Parkinson Disease (CISI-PD), and the Patient Global Impression-Severity (PGI-S). Correlations were high between CGI-S and HY stage and CISI-PD (r > 0.80), but relatively moderate between CGI-S and PGI-S (r = 0.61). In addition, the concordance between CGI-S and HY stage as well as CISI-PD was high across all stages of PD (78.5% and 84.3%) but moderate between CGI-S and PGI-S (61%). Concordance between tests for the severity levels of the four scales was further demonstrated by a moderate Kendall’s coefficient of concordance and generalized kappa statistic (0.67 and 0.52, respectively). These findings are consistent with real-world scenarios, where ratings using patient-administered instruments are generally higher than clinician- or investigator-administered ones, accounting for the rather modest concordance indices.¹⁰⁷

Minimal Clinically Important Difference

No information on the MCID of the CGI-I in patients with PD was found.

Unified Parkinson’s Disease Rating Scale

The UPDRS is the standard instrument for measuring parkinsonian signs and symptoms. The scale is composed of four parts: Part I (mentation, behaviour, and mood; four items), Part II (ADL; 13 items), Part III (motor examination; 14 items), and Part IV (complications of therapy and symptoms including dyskinesia and “off” state; 11 items). Individual items in parts I to III are scored on a 5-point scale (0 to 4), with higher scores indicating worse symptoms, while Part IV includes a number of items scored either numerically (like parts I to III) or using zero or one (no or yes, respectively). The full scale takes 10 minutes to 20 minutes to administer with a range of 0 (no disability) to 199 (worst disability). The ranges of scores for the subscales are: 1) Mentation, Behaviour and Mood (0 to 16); 2) ADL (0 to 52); 3) Motor Examination (0 to 108); and 4) Complications of Therapy (0 to 23).¹⁴⁰ In addition, a number of subscales that assess specific signs of PD — including but not limited to tremor, rigidity, bradykinesia, and postural instability/gait disturbance — have been derived from combining certain items within the scale.¹¹⁸

The subscales, four sections, and overall score of the UPDRS have been thoroughly assessed in different languages, including Hebrew, Hungarian, Italian, Japanese, and Spanish, with clinic- and population-based samples of varying PD severity.¹²¹^–¹²⁶ Despite the scale’s comprehensive coverage of motor symptoms and wide utilization, there are a few notable weaknesses noted by the MDS, including limited non-motor screening items, “flaws and ambiguities” for some items, and inadequate instructions for raters.¹¹⁷ The MDS commissioned a revision of the original scale in 2007, termed the MDS-sponsored UPDRS revision. The new version demonstrated improved psychometric properties in different settings in addition to large-scale comparisons with the original version.¹¹¹ However, the study included in this review used the original UPDRS; therefore, the psychometric properties of the original version are summarized here. Due to abundant literature on the validity of UPDRS in different settings across the world, evidence for the English version of the original UPDRS and its subsections is presented here.

Full-Scale Assessment

Four studies assessed the measurement properties of the original UPDRS in patients with varying degrees of PD severity, including an early validation study by the Cooperative Multicentric Group,¹¹⁹ one multi-centre RCT that included Canada and the US,¹¹⁸ a systematic review of 11 measures of PD,¹²⁰ and a large, multi-centre, cross-sectional study.¹¹³^,¹¹⁵

The Cooperative Multicentric Group reported that the administration time for the UPDRS was brief (10 minutes to 20 minutes).¹¹⁹ The multi-centre study showed a low percentage of missing data across the four subscales (< 10%), indicating a high completion rate.¹¹³^,¹¹⁵ Together, these results demonstrate that the scale can be easily administered within a short period in clinical trial and population settings.

Across all studies, internal consistency was found to be adequate for the full scale as well as its subscales (0.79 ≤ Cronbach’s alpha ≤ 0.96).¹¹³^,¹¹⁵^,¹¹⁹^,¹²⁰ Studies assessing inter-rater reliability found moderate to high inter-rater agreements for most items (0.50 < kappa < 0.90) and fair for a few (0.40 < kappa < 0.50), with total scores highly correlated among raters (r = 0.98).¹¹⁹^,¹²⁰ Additionally, a stable factor structure and high internal consistency was shown across “off” and “on” states for the UPDRS Part III.¹²⁰ Only one multi-centre trial (including Canada and the US) evaluated the test–retest/intra-rater reliability of the UPDRS as measured by neurologists among patients with early-stage PD (EPD). It reported excellent test–retest reliability for the total score (ICC 0.92) and substantial-to-excellent reliability for its subsections (ICC range: 0.74 to 0.90).¹¹⁸ The individual items within the subsections had a varying degree of intra-rater reliability (weighted kappa range: 0.49 to 0.75); the lower scores were likely due to including patients with relatively mild symptoms.¹¹⁸

A panel of 13 international experts independently rated the relevance of the scales and items of the UPDRS to assess content validity, with endorsement by at least 75% of the experts needed to establish satisfactory content validity.¹¹³ With the exception of the UPDRS Part III (83.3%), none of the subscales attained the adequate standard (endorsement rate: 40% to 50%). A similarly low proportion of items in UPDRS parts I, II, and IV achieved an adequate content validity compared with UPDRS Part III.¹¹³ The systematic review by Ramakar et al. identified several key symptoms of motor impairment and disabilities affecting daily life and reported that these features were represented in the UPDRS motor examination and the ADL subscale, albeit unequally without any weighing.¹²⁰

Criterion validity was assessed by one study, and a strong correlation was found between the UPDRS and HY scale (r = 0.71).¹¹⁹ The same study assessed construct validity by comparing the UPDRS with other known measures of disability and functional impairment (convergent validity). Between-scale correlations were strong for the Intermediate Scale for Assessment of PD, the Schwab and England Scale (SES) (r > 0.80 for both), the Mini-Mental State Examination, and the Hamilton Scale for Depression (r = 0.53, 0.64, respectively).¹¹⁹ Authors of the large multi-centre study also evaluated the construct validity of the UPDRS by examining its relationship with the HY score and SES. All four subscales showed moderate to strong correlations with both measures (0.4 < IrI < 0.75).¹¹³^,¹¹⁵ Findings from the systematic review were corroborating.¹²⁰

The subscales’ discriminative ability was supported by a linearly increasing trend in all subscale scores with the progression in HY stage.¹¹³^,¹¹⁵ For all subscales, observed and possible score ranges coincided, with no patients at the upper end of the UPDRS parts I, II, and IV. Large floor effects were seen for UPDRS parts I and IV (> 20%), and all scales reflected small ceiling effects (< 1%).¹¹³^,¹¹⁵ This was likely due to the predominant inclusion of patients with milder symptoms, resulting in skewed data since patients in severe stages unable to participate in research studies were not adequately represented.¹¹³^,¹¹⁵ The scale, and in particular parts I and IV, may not adequately distinguish patients with milder symptoms if they score low on these subscales.

The multi-centre cross-sectional study reported that the item-total corrected correlations were satisfactory overall for all subscales (Part I, 0.57 to 0.66; Part II, 0.30 to 0.80; Part III, 0.25 to 0.74; and Part IV, 0.26 to 0.76); however, a few items within the ADL, motor, and complications-from-treatment subscale had lower correlations.¹¹³^,¹¹⁵ Similarly, in the earlier study, strong item-total correlations were found for most items (0.60 < r < 0.81) and low correlations for others (0.22 < r < 0.50), with lower consistency for items related to depression, motivation/initiative, and tremor.¹¹⁹ Multi-trait scaling analysis showed a high scaling success rate for all four subscales (> 90%). Factor analysis in both studies showed all UPDRS items together explained almost 60% of the variance (58.5%).¹¹³^,¹¹⁵^,¹¹⁹

Part I (Mentation, Behaviour, and Mood)

A US-based study assessed the sensitivity and specificity as well as criterion validity (concurrent) by comparing related items of UPDRS Part I with criterion tests for dementia, psychosis, and depression; namely, the telephone interview for cognition status, psychiatric assessment for psychosis, and the geriatric depression scale. Overall, results for concurrent validity showed moderate to strong correlations with criterion measures (0.38 ≤ r ≤ 0.66). The discriminatory power of the test was fair (area under the receiver operating curve [AUROC] ~70%). However, low or modest ranges of sensitivity (19% to 0.61) and specificity (0.48 to 0.87) were found for the individual items on the subscale, even with an optimal cut-off point.¹¹²

Part II (Activities of Daily Living)

No study was found that reported the validity of the English version of UPDRS Part II specifically; however, one study conducted in Canada and the US assessed the investigator-patient (inter-rater) reproducibility of UPDRS parts I and II using clinical trial data. For assessments done at baseline and 12 months, substantial clinician–patient agreement was found for both UPDRS Part I and Part II (concordance correlation range: 0.6 to 0.7 and 0.78 to 0.81, respectively). Most items on the subscales achieved moderate agreement (0.40 < kappa ≤ 0.80), with the exception of a few that were deemed to be due to limited response variability or subjective nature.¹⁰⁹

Part III (Motor Examination)

One study by Metman et al. investigated the intra-rater reliability of the UPDRS Part III by clinicians in patients with advanced PD. In both “on” and “off” states, clinicians had excellent agreement in total UPDRS Part III score (ICC ~0.90).¹¹⁶

Part IV (Complications of Therapy and Symptoms)

No evidence specifically validating the UPDRS Part IV (English version) was available.

Minimal Clinically Important Difference

Schrag et al. estimated MCIDs for the UPDRS ADL, motor, and total scores retrospectively among patients with EPD using data from two independent, active-controlled RCTs (N = 603 total).¹¹⁴ Using the CGI-I as an anchor, the minimal change representing the MCID following six months of anti-PD treatment was determined. They reported a mean change in 5 and 8 points on the UPDRS motor and total scores as the cut-off, respectively, corresponding to an HY stage of I to III. On the other hand, a mean change in 2 and 3 points on the UPDRS ADL score was found to be appropriate, corresponding to a HY stage of I/I.5 to II and II.5/III, respectively.¹¹⁴

Three other studies provided estimates of MCIDs for subscales and total score among patients with varying stages of PD. One cross-sectional study (N = 653, representing all PD stages) used both distribution and anchor-based (based on SES, HY stage, and Short Form (SF)-12) approaches and reported MCIDs of 2.3 points for the UPDRS motor score and 4.1 points for the total score.¹¹⁰ Hauser et al. analyzed data from two separate, placebo-controlled RCTs that included EPD patients (N = 404) and patients with motor fluctuations (N = 472); total duration was six months. Using the CGI-I as an anchor, they reported an MCID of 0.5 to 0.7 points for the ADL score, 2 to 2.4 points for the motor score, and 3 to 3.5 points for the total score (I to III).⁶¹ Using data from two other double-dummy, placebo-controlled RCTs among EPD patients (N = 539, 33 weeks) and APD patients (N = 517, 18 weeks), Hauser et al. estimated the MCIDs for UPDRS parts II and III using the CGI-I and PGI-I as anchors. Among EPD patients, the range of MCIDs for UPDRS parts II and III was 1.8 to 2 points and 6.1 to 6.2 points, respectively. Among APD patients, the range of MCIDs for UPDRS parts II and III was 1.8 to 2.3 points and 5.2 to 6.5 points, respectively.⁶³

EuroQol 5-Dimensions 3-Levels

The EQ-5D-3L is a generic, preference-based HRQoL instrument that has been applied to a wide range of health conditions and treatments, including PD.¹²⁷^,¹²⁸ The first of two parts of the EQ-5D-3L is a descriptive system that classifies respondents (aged ≥ 12 years) into one of 243 distinct health states. The descriptive system consists of the following five dimensions: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. Each dimension has three possible levels (1, 2, or 3) representing “no problems,” “some problems,” and “extreme problems,” respectively. Respondents are asked to choose one level that reflects their own health state for each of the five dimensions. A scoring function can be used to assign a value (EQ-5D-3L index score) to self-reported health states from a set of population-based preference weights.¹²⁷^,¹²⁸ The second part is a 20 cm visual analogue scale (EQ VAS) that has end points labelled 0 and 100, with respective anchors of “worst imaginable health state” and “best imaginable health state,” respectively. Respondents are asked to rate their own health by drawing a line from an anchor box to the point on the EQ VAS that best represents their own health on that day. The EQ-5D-3L produces three types of data for each respondent:

a profile indicating the extent of problems on each of the five dimensions represented by a five-digit descriptor, such as 11121, 33211, etc.
a population preference-weighted health index score based on the descriptive system
a self-reported assessment of health status based on the EQ VAS.

The EQ-5D-3L index score is generated by applying a multi-attribute utility function to the descriptive system. Different utility functions are available that reflect the preferences of specific populations (e.g., US or UK). The lowest possible overall score (corresponding to severe problems on all five attributes) varies depending on the utility function that is applied to the descriptive system (e.g., −0.59 for the UK algorithm and −0.109 for the US algorithm). Scores less than 0 represent health states that are valued by society as being worse than dead, while scores of 0 and 1.00 are assigned to the health states “dead” and “perfect health,” respectively. There is a five-level version of the EQ-5D (the EQ-5D-5L) that is now available. It is also commonly used.¹²⁷^,¹²⁸

The EQ-5D has been extensively validated across countries around the world and in various conditions; however, its validity in patients with PD is relatively sparse. A systematic review by Xin et al. assessed the construct validity (convergent validity and discriminative ability) and responsiveness of the instrument in patients with PD.⁶⁴ Results from six studies showed that the correlations between the EQ-5D-3L index score and the PDQ-8 summary score, HY staging, and UPDRS total score was strong (r = −0.75), moderate (−0.32 < r < – 0.53), and moderate to strong (0.39 < IrI < 0.72), respectively. Five studies provided adequate information for the assessment of the discriminative ability of the EQ-5D-3L; four showed satisfactory discriminative ability of the index score to accurately distinguish patients based on the presence of apathy, dyskinesia, “wearing-off” period, and sweating disturbances.⁶⁴ The remaining study found the EQ-5D-3L was adequate in differentiating clinically different groups based on PD severity as well as various motor and non-motor symptoms; however, the discrimination was more evident for mild and severe cases of PD and less so for adjacent stages.⁶⁴

The responsiveness of the EQ-5D-3L was reported in 12 studies in the aforementioned systematic review.⁶⁴ Six studies showed a statistically significant change in the EQ-5D index score over time, which was consistent with other established scales used as reference measures, including the UPDRS Part II, PDQ-39, PDQ-8, HY score, and Hospital Anxiety and Depression Scale (HADS). In the remaining six studies, the aforementioned measures did not show a consistent pattern of increase or decrease with the progression of disease.⁶⁴

Minimal Clinically Important Difference

Information regarding an MCID for the EQ-5D among PD patients is scarce. The aforementioned systematic review reported an estimated MCID of 0.10 (range: 0.04 to 0.17) and 0.11 (range: 0.08 to 0.14) based on the UPDRS and PDQ-39 score, respectively; however, this was obtained from a conference abstract.⁶⁴ Other reported MCIDs for the EQ-5D-3L range from 0.03 to 0.07.¹⁴¹

Zarit Burden Interview

The ZBI is a commonly used, generic, self-reported measure to assess caregiver burden resulting from providing continuous and long-term physical, psychosocial, and financial support to patients with a disabling condition. The scale originated as a 29-item questionnaire but was later revised into a 22-item scale (ZBI-22), the same version used in study 001/002.¹³⁴ In addition, multiple shorter versions of ZBI are available, ranging from as short as one item to 18 items.¹³³ The 22-item scale is available in at least 47 languages, including English, French, German, Hindi, Japanese, Korean, Mandarin, Portuguese, Russian, and Spanish.¹³⁵ The ZBI-22 measures the perceived burden of caregivers for each of the 22 items on a 5-point Likert scale, ranging from 0 (never) to 4 (nearly always). Scores are then summed to create a total score that can range between 0 and 88, with higher scores indicating more burden.¹³²^,¹³³ The ZBI score is interpreted using the following cut-offs:

0 to 21 (little or no burden)
21 to 40 (mild to moderate burden)
41 to 60 (moderate to severe burden)
61 to 88 (severe burden).

Even though a number of studies examined the validity, reliability, and reference values of the ZBI-22 scale, only two cross-sectional studies involved caregivers of PD patients. However, neither used the English version. Hagell et al. assessed the psychometric properties of the Swedish version of the ZBI-22 among family caregivers of patients with PD.¹³¹ Another multi-centre Spanish study conducted a validity assessment on several clinician and self-administered scales, including the ZBI-22.³⁵

In both studies, the scaling assumptions were assessed from item-total correlations to determine the appropriateness of summing item scores into a total score. The item-total correlation ranged from 0.42 to 0.80 in the Swedish study¹³¹ and 0.31 to 0.78 in the Spanish study,³⁵ indicating scaling assumptions were met for most items. Factor analysis identified five factors that accounted for 80% of the variance. These factors were the most important predictors of caregiver burden, including the psychological well-being of the caregivers themselves, clinical aspects of disease, patients’ mood, and the HRQoL of patients and caregivers alike.³⁵ An assessment of targeting was done to determine how well the scale scores accord with the range of burden in the sample. Overall, a satisfactory targeting was found, as demonstrated by a floor/ceiling effect of < 2% and skewness ranging from 0.27 to 0.67.³⁵^,¹³¹

Internal consistency was satisfactory in both studies, with Cronbach’s alpha 0.95 and 0.93 in the Swedish and Spanish studies, respectively.³⁵^,¹³¹

Construct validity was assessed in both studies. Hagell et al. reported moderate to strong correlations (range 0.36 to 0.69) between ZBI-22 and a number of scales and questionnaires for measuring similar aspects of health, including the SF-36 version 1, the sleep section of the Nottingham Health Survey, patients’ PD duration, and the PD Activities of Daily Living Scale.¹³¹ The Spanish multi-centre study also reported moderate to strong correlations between the ZBI-22 and the global and physical components of several HRQoL measures, including the Berthel index, CGI-S, EQ-5D, HY score, scales for outcomes in Parkinson disease-motor ADL, and the SF-36 (0.25 < IrI < 0.60), and relatively strong correlations with a number of mental health components of the SF-36 and HADS (r > 0.60).³⁵ Further validity was demonstrated after caregivers with a disease of their own reported an expectant higher ZBI-22 score than those without any concomitant disease, and caregivers’ time and strain were associated with the scale.³⁵^,¹³¹ Finally, the discriminative performance of ZBI-22 was assessed using AUROC curves, which represents the ability to accurately classify people with and without burden according to chosen cut-points. The 11 chosen cut-points for ZBI-22 showed a high discriminative ability, AUROC of 0.98, with a similarly high Youden index of 0.96.¹³¹ ZBI-22 in the Spanish study also showed superior discriminative validity in accurately registering higher scores with advanced PD stages.³⁵

Minimal Clinically Important Difference

No information on MCIDs for the ZBI-22 was found among caregivers of patients with PD.

Conclusion

The PDHD was validated and found reliable in just a single study.
The PDQ-39 is an MDS-recommended, self-administered, multi-dimensional, PD-specific HRQoL that has been thoroughly validated. Overall, internal consistency and test–retest reliability (inter- and intra-rater) were high. The construct and discriminant validity as well as responsiveness were relatively higher for domains measuring physical and functional aspects of PD compared with psychosocial symptoms.
The CGI and its subscales have limited evidence of validity; no studies specifically validating CGI-I were found. The CGI-S showed high concordance with other clinician-administered instruments, but relatively lower concordance with patient-administered instruments. An MCID for CGI-I was also not found.
The UPDRS and its four subscales have been validated extensively across different populations and are recommended by the MDS. The scale is easy to administer and had excellent inter-rater and intra-rater agreement when administered by different health professionals. Evidence of validity is satisfactory overall. Correlations between the UPDRS and a number of established measures of disability and functional impairment are high. However, some items in Part I are redundant or have suboptimal internal consistency, low sensitivity, and low specificity. Notably, validation metrics are not commonly measured in both “on” and “off” states and data from severe cases are disproportionately represented.
The EQ-5D-3L is a well-validated measure of generic HRQoL that correlates strongly with physical attributes of standard measures of functional disabilities, but relatively weakly with psychosocial attributes. The scale showed varying responsiveness to clinical changes and limited sensitivity to detect changes in milder PD cases.
The ZBI-22 has been validated for caregivers of patients with various conditions, including PD. Studies showed that scaling assumptions were generally met, discriminative validity was adequate, and internal consistency and correlations with the mental health components of established HRQoL measures were high; but correlations were moderate for the global and physical components of these measures. The physical and psychological well-being of caregivers, clinical severity of disease, patients’ and caregivers’ mood, and HRQoL all affected caregiver burden. No information was found regarding test–retest reliability, responsiveness, or MCID.

The copyright and other intellectual property rights in this document are owned by CADTH and its licensors. These rights are protected by the Canadian Copyright Act and other national and international laws and agreements. Users are permitted to make copies of this document for non-commercial purposes only, provided it is not modified when reproduced and appropriate credit is given to CADTH and its licensors.

Except where otherwise noted, this work is distributed under the terms of a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International licence (CC BY-NC-ND), a copy of which is available at http://creativecommons.org/licenses/by-nc-nd/4.0/

Bookshelf ID: NBK539554

Contents

< Prev Next >

PubReader
Print View
Cite this Page
Clinical Review Report: Levodopa/Carbidopa (Duodopa): (Abbvie Corporation): Indication: For the treatment of patients with advanced levodopa-responsive Parkinson’s disease who do not have satisfactory control of severe, debilitating motor fluctuations and hyper-/dyskinesia despite optimized treatment with available combinations of Parkinson’s medicinal products, and, for whom the benefits of this treatment may outweigh the risks associated with the insertion and long-term use of the percutaneous endoscopic gastrostomy-jejunostomy (PEG-J) tube required for administration [Internet]. Ottawa (ON): Canadian Agency for Drugs and Technologies in Health; 2018 Sep. Appendix 5, Validity of Outcome Measures.
PDF version of this title (998K)

Validity of Outcome Measures - Clinical Review Report: Levodopa/Carbidopa (Duodo...
Validity of Outcome Measures - Clinical Review Report: Levodopa/Carbidopa (Duodopa)

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Bookshelf

Appendix 5Validity of Outcome Measures

Aim

Findings

Table 24Validity and Minimal Important Differences of Outcome Measures

The Parkinson Disease Home Diary

Minimal Clinically Important Difference

39-Item Parkinson’s Disease Questionnaire

Minimal Clinically Important Difference

Clinical Global Impression – Improvement

Minimal Clinically Important Difference

Unified Parkinson’s Disease Rating Scale

Full-Scale Assessment

Part I (Mentation, Behaviour, and Mood)

Part II (Activities of Daily Living)

Part III (Motor Examination)

Part IV (Complications of Therapy and Symptoms)

Minimal Clinically Important Difference

EuroQol 5-Dimensions 3-Levels

Minimal Clinically Important Difference

Zarit Burden Interview

Minimal Clinically Important Difference

Conclusion

Views

In this Page

Other titles in this collection

Recent Activity