SPECIAL REPORT Clinical Outcomes Assessment in Multiple Sclerosis Richard Rudick, MD,* Jack Antel, MD,f Christian Confavreux, MD,$ Gary Cutter, PhD,$ George Ellison, MD,? Jill Fischer, PhD,* Fred Lublin, MD,** Aaron Miller, MD,?? John Petkau, PhD,$$ Stephen Rao, PhD,$s Stephen Reingold, PhD,SJ Karl Syndulko, PhD,*** Alan Thompson, MD,?H Joy Wallenberg, MD,+$$ Brian Weinshenker, MD,S$S and Ernest Willoughby, MDSSS This article represents initial deliberation of an international task force appointed by the US National Multiple Sclerosis Society to develop recommendationsfor optimal clinical assessment tools for multiple sclerosis clinical trials. Presented within this article are the key issues identified by the task force during its initial year of deliberation. These include the precise purpose for a clinical assessment tool, the clinical dimensions to be measured in a multidimensional outcome measure, desirable attributes of an optimal clinical outcome measure, the complexities of multidimensional outcome measures, the relative merits of categorical clinical ratings and quantitative functional assessments, and a number of other important design issues that relate to the use of a multidimensional outcome measure. An action plan for analysis of existing data is summarized, as are the plans for more detailed recommendations from the task force. Rudick R, Antel J, Confavreux C, Cutter G, Ellison G, Fischer J, Lublin F, Miller A, Petkau J, Rao S, Reingold S, Syndulko K, Thompson A, Wallenberg J, Weinshenker B, Willoughby E. Clinical outcomes assessment in multiple sclerosis. Ann Neurol 1996;40:469-479 Assessing the impact of experimental intervention for multiple sclerosis (MS) requires clinical outcome assessment tools [I], but precise clinical measurement in MS patients is difficult for many reasons. The clinical manifestations vary widely in different patients, and vary within a given patient over time. Furthermore, the clinical course is not usually characterized by steady worsening, but rather by variable episodes of clinical worsening followed by improvement, by long periods of stability, o r by phases of steadily progressive clinical deterioration. This is complicated by the fact that neurological impairment and disability are inherently difficult to quantify. Thus, precise and universally accepted assessment tools for use in MS clinical trials have been difficult to develop. In response to these difficulties, the National Multiple Sclerosis Society (NMSS) sponsored an international workshop titled “Outcomes Assessment in Multiple Sclerosis Clinical Trials: A Critical Analysis” in Charleston, South Carolina, in February 1994. Among other deliberations, participants at the workshop identified desirable attributes for clinical measurements: en- dorsed a multidimensional assessment measure that contains the multiple, relatively independent clinical dimensions of MS, including cognitive function; and agreed that no existing clinical scale was optimal [2]. The report from the Charleston meeting [2] stated From the * Mellen Center, Cleveland Clinic Foundation, Cleveland, OH; ?Montreal Neurological Institute, Montreal, Quebec, Canada; $Hopital De L-Antiquaille, Lyon, France; $AMC Cancer Institute, Denver, CO; ?University of California at Los Angeles Medical Center, Los Angeles, C1z; **JeffersonMedical College, Philadelphia, PA; ??Maimonides Medical Center, Brooklyn, NY; $$University of British Columbia, Vancouver, British Columbia, Canada; SSMedical College of Wisconsin, Milwaukee, WI; SSNational Multiple Sclerosis Society, New York, NY; ***VA Medical Center West Los Angeles, Los Angeles, CA; TttInstitute of Neurology, Queen Square, London, United Kingdom; $$$Bedex Laboratories, k c h mond, CA; $$$Mayo Clinic, Rochester, MN; and SSSAukland Hospital, Auckland, New Zealand. The authors constitute the Clinical Outcomes Assessment Task Force, under the Advisory Committee on Clinical Trials of New Agents in Multiple Sclerosis of the United States National Multiple Sclerosis Society. There is a clear need for development of new assessment systems, probably based upon the best aspects of the EDSS scales [Kurtzke Expanded Disability Status Scale]. Any new system must be multidimensional and quantitative. Preferentially, its scoring should be automated to speed the process and to improve consistency from assessment to assessment, between raters and among centers. It should have adequate evaluation of cognition for which there are many validated, though not currently practical, systems. The Multiple Sclerosis Clinical Outcomes Assessment Task Force (herein referred to as the Task Force) was convened by the NMSS Advisory Committee on Clinical Trials of New Agents in Multiple Sclerosis as a direct result of the Charleston workshop, to develop Received Mar 13, 1996. Accepted for publication Mar 13, 1996. Address correspondence to Dr Richard Rudick, Mellen Center, Cleveland Clinic Foundation, Cleveland, O H . Copyright 0 1996 by the American Neurological Association 469 recommendations for optimal clinic,al assessment measures for use in future MS clinical trials. The Task Force consists of 16 members representing five countries, with expertise in neurology, psychology, biostatistics, epidemiology, and drug development. The Task Force will develop recommendations for endorsement by the NMSS Advisory Committee on Clinical Trials. The recommendations will be then forwarded to the Medical Advisory Board of the NMSS for their approval. Subsequently, these guidelines will be distributed as a resource for investigators and clinical trial sponsors seeking optimal design strategies for MS clinical trials. The recommendations will also be forwarded to the International Federation of Multiple Sclerosis Societies for their consideration and further dissemination. A similar process resulted in guidelines for clinical trials in motor neuron disease [3]. The purpose of this article is to provide a summary of the Task Force deliberations to date as background for its recommendations, which will follow. a clinical outcome assessment tool for controlled clinical trials probably would not be optimal for classifying individual patients by disease type, for routine clinical care, for specialized purposes like testing symptomatic therapies (e.g., drugs for bladder dysfunction), or for economic or quality of life studies. These were considered important purposes for clinical outcome measures, but the Task Force decided to focus on developing an outcome measure for clinical trials focused on interventions designed to modify the disease course. The Task Force agreed to focus on developing a measure for disease progression, defined as sustained neurological or neuropsychological deterioration with resulting disability. We recognized that an outcome assessment measure tailored to this purpose may not be optimal for assessing the characteristics of relapses. Assessment measures for detecting and grading relapses may comprise a subsequent goal of the Task Force. Goal of the Task Force Sensitivity to disease progression, also called sensitivity or responsiveness [4, 51, was considered a key attribute for a clinical assessment tool used in MS clinical trials. If the clinical measure does not change during the course of the trial in the control group, it cannot be useful as an outcome measure because it will fail to demonstrate a difference in the active treatment arm, unless there is marked improvement in the treated group. The need for adequately sensitive or responsive measures is considered one of the greatest challenges, because clinical trials are currently conducted over a 2to 3-year time frame while populations of MS patients experience clear clinical deterioration over time frames longer than 10 years. The required sample size for a clinical trial is affected not only by the trial duration, but also by the responsiveness of the clinical outcome measure. Currently, using a placebo control group and traditional clinical outcome measures such as the EDSS, it has been possible to demonstrate statistically significant therapeutic effects with approximately 150 patients per treatment arm and a study duration of 2 to 3 years. As trials become more complex, with multiple treatment arms, sample sizes will have to increase significantly, adding to the complexity and cost of studies. More sensitive outcome measures could reduce the required sample size or length of study, thereby conserving resources. The need for increased outcome measure sensitivity also dramatically escalates with partially effective therapies, as illustrated by the following. Assume one wishes to determine the effect of an intervention compared to a placebo. Further, assume that 40% of the placebo group experiences a significant worsening in 3 years compared to only 24% of the active treatment group. Using a two-tailed test of significance with a = 0.05 2. “. . . the outcome measure must be useful for demonstrating clinical change due to MS. ” At its initial meeting (Chicago, October 14-15, 1994) the Task Force agreed that the purpose for any clinical outcome assessment tool must be (clear in advance of developing the outcome measure. The group concluded that in its activity, it would strive to recommend a clinical outcome assessment tool for future MS clinical trials. The purpose of this clinical outcome assessment tool will be to reflect the impact of an intervention on the progression of disease. Therefore, the outcome measure must be useful for demonstrating clinical change due to MS. The outcome measure will be multidimensional to reflect the principal ways MS affects an individual, have high reliability and validity, be sensitive to change over short time intervuls to permit demonstration of a therapeutic effect, and must be both practical and cost-effective. [Italics are authors’ emphasis.] Rationale for Task Force Goal The Task Force goal was adopted after considerable discussion regarding the component phrases, those italicized in the previous quote. In the sections that follow, details explaining the Task Force goal are provided. 1. “The Task Force will recommend a clinical outcome assessment tool for fiture M S clinical trials. The puTose of this clinicul outcome ussessment tool will be to reflect the impact of an intervention on the progression o f disease. )’ The most pressing need recognized by the Task Force was for an assessment tool to facilitate progress in developing effective MS disease therapies. The Task Force recognized that no individual assessment tool would be optimal for all clinical purposes. For example, 470 Annals of Neurology Vol 40 No 3 September 1996 and 1 - p = 0.80 (power = go%), the study would require 132 subjects per group, or a total of 264 subjects. Once a treatment that accomplishes this goal (e.g., clinical worsening is reduced from 40% to 24%) is identified, a trial is designed to test an alternative therapy against the partially effective new treatment. The study is designed with the same assumptions, only the new treatment is required to reduce the worsening by 40% from the previous partially effective result of 24%. The new study would require 260 patients per group or a total of 520 patients, which represents a doubling of the sample size. This doubling of the sample size would require either many more clinical centers or longer duration of enrollment to answer the question of treatment effectiveness. Thus, as partially effective therapies are identified, the required sample size or length of follow-up increases in subsequent trials using the same outcome measure. This example illustrates a problem that has already arisen with the advent of partially effective interventions [6-81. Assessment measures that change more quickly (i.e., are more sensitive) might allow smaller sample sizes or shorter studies for subsequent clinical trials attempting to improve on earlier results. 3. “The outcome measure will be multidimensional to reflect the princ$al ways MS affects an individual. . . . Clinical heterogeneity typifies MS patient populations. Individual MS patients manifest impairments in visual, ” sensory, pyramidal, cerebellar, brainstem, and forebrain association pathways with consequent functional problems with vision, ambulation, sensory function, leg and arm weakness and coordination, bowel and bladder control, and cognitive function. This clinical heterogeneity prompted earlier MS investigators to propose multidimensional outcome measures [9]. The Task Force endorsed the approach of including the main clinical dimensions of MS in a primary multidimensional outcome measure, although it is recognized that not every clinical manifestation could or should be included in a clinical outcome measure. The choice, then, is to determine which dimensions to include. The Task Force reviewed two separate methods for determining the principal clinical dimensions of MS that should be measured. The first method, factor analysis, is a statistical method that groups variables or tests by their relatedness to other variables or tests [lo]. The second method involves the written and verbal reports of patients and professionals experienced with the disease [1 I]. Given the sources of data that are available in well-characterized MS populations, both methods converge in identifying similar clinical dimensions. Factor analysis has been applied to a number of data sets from MS clinical trials (12, 131 that include subjects with relapsing remitting as well as chronic pro- gressive MS. Factors identified by these analyses are remarkably consistent across data sets. The following dimensions of MS have been identified by this method: (1) leg dysfunction; ( 2 ) arm dysfunction; (3) sensory dysfunction (superficial touch, position sense, and possibly vibration threshold); ( 4 ) visual dysfunction; (5) mental or cognitive dysfunction; and (6) bowel, bladder, and sexual dysfunction. The analyses also demonstrated that these six dimensions are highly correlated with neuropsychological testing, elements from the standard neurological examination, the Kuttzke EDSS and functional system scores (FSSs), quantitative motor testing, and patient self-report measures. For example, the pyramidal FSS from the EDSS was most closely associated with the leg dysfunction dimension, and the cerebellar and brainstem FSSs were most closely associated with the arm dysfunction dimension. Clinicians on the Task Force endorsed the validity of these six clinical dimensions identified by factor analyses. Specifically, neurologist members of the Task Force who see large numbers of MS patients in their practices generally concurred that the six areas were reasonable clinical dimensions on which to focus measurements. Additionally, the neurologists on the Task Force recommended that gait and mobility testing be included as a separate factor from leg function. The Charleston meeting participants had endorsed the inclusion of neuropsychological assessment as a component of the clinical outcome measure. The Task Force reviewed existing studies of cognitive function in MS patients and agreed that cognitive assessments can be practical to administer, are sensitive to change in MS patients, and correlate with changes in a relevant nonclinical measure-cranial magnetic resonance imaging (MRI)-in properly controlled trials. A community-based sample of MS patients tested longitudinally indicated that some individuals experience progressive neuropsychological impairment over a 3-year time frame and that this had demonstrated criterion validity; that is, worsening on neuropsychological tests was correlated with increasing MEU forebrain disease burden [ 141. Furthermore, analysis from recent clinical trials demonstrated treatment effects on measures of complex attention [ 151. The Task Force therefore concluded that tests of neuropsychological functioning should be considered for inclusion in a clinical outcome measure. While the Task Force supports the use of a clinical outcome measure that will optimally assess the abovenoted principal MS clinical dimensions, it has not made a final determination about whether or nor to include measures for all six dimensions or specifically which measurement tools should be used for each dimension. This will depend on the availability of simple and efficient methodology for reliably quantifying the dimension (e.g., this may be very difficult for bowel, bladder, and sexual dysfunction, and remains to be Special Report: Rudick et al: Clinical Outcomes Assessment in MS 471 demonstrated for neuropsychological dysfunction), the relationship between change among the variables (e.g., if two variables always change together, including both is redundant), the sensitivity to detect change (e.g., if a particular variable does not change appreciably in the time course of a controlled clinical trial, it will not be useful as part of a multivariate or composite outcome measure), and the relative frequency of occurrence of that dimension in a particular sample of MS patients (e.g., if only a small subset of patients can be expected to show a particular sign or performance change, then it may not be useful or cost-effective in detecting statistically significant change in the whole sample). Coming to a consensus about measures for i:he relevant clinical dimensions is an ongoing Task Force activity. Table 1. Desirable Attributes of Clinical Outcome Measurer” for MS Trials Explanation Performance attributesh Level of measurement‘ Reliability Sensitivity Validity 4. “The outcome measure will . . . have high reliability and validig, be sensitive change over short time intervals to permit demonstration of a therapeutic effect, and must be both practical and cost-effective. Desirable attributes for MS clinical outcome measures 160 Practical advantages Easy to administer ” were discussed at the Charleston consensus workshop [ 2 ] .These attributes (Table 1) significantly influence the effectiveness of outcome measures when applied in the clinical setting [ 161. Measures are used to order individual scores within scales [17]. Measures can be grouped within nominal, ordinal interval, or ratio scales. Nominal scales group individual cases without rational quantitative relationships among the categories (e.g., males or females, African Americans or Whites). In ordinal scales, or so-called ordered classifications, scores represent groupings of some underlying measurement scale. The Kurtzke EDSS is an example of an ordinal scale. Ordinal scales are used under two different circumstances. First, the ordinal scale can represent a grouping of scores derived using a continuous scale (e.g., 75-100% of normal function is grouped as normal; 50-75%, as minimal impairment; 25-50%, as moderate; and 0-25%, as severe). Second, the ordinal scale can be used when the phenomenon in question cannot be measured using a continuous scale. In that setting, the ordered classification represents an attempt to approximate the continuous scale by a cruder scale that is the best one can do at rhe time. The quantitative distance between steps on ordinal scales in this circumstance may be unknown, and it is generally inappropriate to use arithmetic oper.itions and parametric statistical tests to analyze change on ordinal scales. In interval or continuous scales, scores are ordered with an indication of how far apart the objects are from one another, and for ratio scales scores are also assigned with respect to the distance from an absolute zero. ExI EVEL OF MEASURFMENT 472 Annals of Neurology Vol 40 No 3 September 1996 Acceptable to patients and health care professionals Resource efficient The score should be quantitative to the extent possible, and the distance between points on the scale should be known. The score should have a high intrarater and interrater reliability, or for self-report measures should have high testretest reproducibility. The measure should be sensitive to clinical change over a relatively short time interval. The clinical outcome measure should have demonstrable validity as discussed below. The test measure should be easy and quick to administer. The measurement technique should be consistent with comfort, safety, and compliance. The test measure should conserve time and resources. ‘A “measure” is a set of rules designed to assign numbers to relevant phenomena (e.g., leg or visual function). In MS, demographic measures (e.g., age, gender) are straightforward and do not require complex rules; “disease measures” are used to measure constructs (our hypotheses about the ways the MS disease process affects the individual, e.g., sensory dysfunction, ataxia). Disease measures are more complicated and require a more elaborate set of rules. bAn attribute is a characteristic of a measure. ‘Scores from clinical measures are used to place an individual along a scale. Scales can be nominal, ordinal, or interval scales, which have varying characteristics, as discussed in the text. amples of interval or continuous scales are timed tests of neurological function [ 181. Internal scales offer the potential advantage that arithmetic operations, such as taking differences between the end and beginning of the trial, are meaningful. This makes statistical analysis of change straightforward. The Task Force recognized the advantages of interval scales, when they are available to measure the dimensions of interest. KELIABILITY. This refers to the reproducibility of an outcome measure. The usefulness of an outcome measure is directly related to its reliability, as change over time from disease progression or improvement can be obscured by variability derived from the outcome measure itself. Standard methods can be used to assess reliability, including repeated measurements made by the same rater in the same session or over short time intervals such as successive days (intrarater or test-retest reli- ability), and repeated measurements made on the same subject by different raters (interrater reliability), Validity is defined as measuring what one intends to measure. Various types of validity have been defined and discussed in detail elsewhere (e.g., see [19, 201 for particularly illuminating discussions). Criterion validity refers to cross-validation of the outcome measure with another relevant measure, such as MRI. Predictive validity refers to the ability of a measure to predict future clinical status. For example, significant change on a quantitative test of upper extremity function observed during a 12-month period may not be of obvious clinical significance, but may be shown to have predictive validity by demonstrating a relationship between short-term change and inability to use the arm for feeding 5 years later. In this context, the outcome measure (e.g., quantitative assessment of upper extremity function) may predict the subsequent behavior on a criterion variable (e.g., ability to dress or feed oneself). Predictive validity is particularly important when making short-term measures of a slow chronic process. The Task Force recommends demonstrating the predictive validity for outcome measures when the clinical relevance of a short-term change is not obvious [I]. VALIDITY. Costs relate to personnel, equipment, space, and time requirements. Optimally, administration time for the clinical outcome assessment measures should be brief. The clinical assessment measure should also be acceptable to both neurologists and patients. Testing should be comfortable and safe for the patient. Any instrumentation must be highly reliable, easy for training and administration, and usable over a wide range of patient disability status [21]. P u c m x L ADVANTAGES. Complexities and Challenges in Achieving the Goal Several challenges and complexities in improving MS outcome measures have been considered by the Task Force. These include the difficulty in precisely quantifying neurological function; the need to evaluate different existing outcome measures for their potential value in a new outcome assessment measure; and the complexities of a multivariate outcome measure, which is required to simultaneously measure multiple clinical dimensions. 1. Quantifiing Neurological Function The World Health Organization [22] distinguished impairment from disability and handicap. According to this classification, impairment is caused by the underlying disease process and results in abnormalities evident on the neurological examination. Functional consequences of these impairments resulting in problems with activities of daily living are termed disabilities. These can be quantified by standardized timed tests of neurological function. The vocational, social, or role limitation resulting from the interaction between disability and the environment are termed handicaps, which can be measured by quality of life methodologies. The Task Force advocated measuring impairment with categorical clinical ratings, and disability using quantitative functional assessments. Quality of life scales, while considered extremely important, were not thought appropriate to directly quantify the neurological effects of the disease process. The relative merits of categorical clinical ratings and timed functional assessment in an MS clinical outcome measure are unclear. It was recognized, however, that timed functional assessments may offer significant advantages over clinical rating scales because function can be measured reliably and expressed on an interval scale. In contrast, the neurological examination is commonly expressed as an ordinal scale with nonlinear and indistinct boundaries between steps [23, 241. 2. Comparing Alternative Assessment Tools There is a clear need to compare reliability, sensitivity, predictive validity, and cost-effectiveness of quantitative tests of neurological function with available clinical ratings in MS patients to guide development of an optimal clinical outcome measure. However, comparison of measures that use different metrics is difficult. A variety of statistical procedures have been proposed [4, 25-32], but only a few have been applied to data from MS patients. These techniques can be used to assess relative sensitivity to disease progression (change over time in placebo-treated patients) and to treatment effect (differential change over time related to treatment). One of the first methods applied in comparative assessment of MS measures was signal-to-noise ratio (SNR) analysis [12, 331. The SNR for linear change is a ratio of the standard deviation for the linear orthogonal contrast (from a repeated-measures analysis of variance) of a measure across all time points to the average standard deviation for third-order and higher orthogonal contrasts. It provides a quantitative index of how strongly the change over time for a given measure approximates a linear function and how steep the slope is. The higher the linear SNR, the steeper the slope and the better the fit to a straight line. The expectation is that a good candidate assessment measure should have a SNR higher than 2.0. Syndulko and colleagues [33] showed that quantitative functional assessments had favorable SNRs compared with clinical ratings of neurological impairment in the placebo group from a multicenter cyclosporine clinical trial. A second candidate statistical procedure is effect size [30, 34, 351. The value of effect size is to provide an absolute scale along which to rank sensitivity of differ- Special Report: Rudick et al: Clinical Outcomes Assessment in MS 473 ent outcome assessment measures TO disease progression or treatment effects using data from clinical trials. An effect size is a standardized estirnate of the magnitude of a statistically determined experimental effect, such as change over time in a group or the difference between two groups [30].The effect size provides a standardized unit of measurement for comparing the sizes of changes for different outcome measures within a study or between studies. It is used in addition to the statistical p value to help interpret and compare the meaningfulness of the changes found in a study. It is also used to calculate the sample size required for a proposed study to show statistically significant changes between or within groups. Cohen [35] presented a wide variety of effect size measures, their use for sample size determinations, and their interpretation or meaningfulness. By using an effect size measure calculated from the results of a repeated-measures analysis of variance, called the f index by Cohen, quantitative functional assessments in the multicenter cyclosporine clinical trial were found to have larger effect sizes than were categorical clinical ratings [33].Confirmation of this finding would mean that smaller sample sizes would be required in clinical trials using quantitative functional assessments as the primary outcome. assessment as opposed to clinical rating scales. 3. Complexities of a Multivariate Outcome Measure In a multidimensional outcome measure, the score reflects more than one clinical dimension. At least in the lower part of the range, the IZDSS represents a multidimensional outcome measure that uses the neurological examination to make clinical ratings of the different dimensions of MS (e.g., ithe functional systems), which are then combined into a single rating [36], Such derived measures, often called composites, have been used widely in medicine to characterize complex clinical phenomena [37].Roberts [38] provided a succinct discussion of the advantages and disadvantages of such composite measures. A multivariate outcome measure, on the other hand, is a collection of individual component measures, each grading a particular aspect of the disease (e.g., ambulation, arm function, vision, cognitive function, bladder function, and sensory function) and each retaining its identity as an individual componenl . Each component could itself be a composite measure lbased on a number of measures of that same clinical dimension (e.g., a “cognitive” score could represent the sum of three individual test scores). If it is not clear how the components corresponding to the different dimensions should be combined into a composite outcome measure, then statistical approaches are required for dealing with such a multivariate outcome. Dealing with a multivariate outcome entails numerous statistical complexities that were addressed by a 474 Annals of Neurology Vol 40 No 3 September 1996 member of the Task Force [39]. Suppose there are k such dimensions with corresponding component measures that are continuous variables. Assume that the principal parameter of interest is change (i.e., the difference between the final and baseline visits) and that a treatment arm is to be compared with a placebo arm. Two quite different simple statistical approaches can then be described. With the most common approach, the component for each dimension is analyzed separately. The assessment of the treatment effect for each dimension is based on the z score resulting from the difference between the mean change scores on the corresponding component measure in the treated and placebo groups. But the level of type I error for the comparison on each dimension is adjusted for the number of dimensions, from a to a l k (Bonferroni adjustment), to ensure that the overall type I error for all k comparisons is no more than the target level of a. The main difficulty with this approach is the need to increase the sample size to achieve the more conservative alk significance level at a given level of power and the fact that the overall type I error will be somewhat less than the target value of a, thereby resulting in less overall power than might otherwise be possible. Consider the example provided earlier for a comparison that reduces the worsening from 40% in the placebo group to 24% in the intervention group. If we used measures for six clinical dimensions of MS, we would then use a revised type I error of 0.05/6 = 0.008. The effect of this lowered a = 0.008 corrected for the multiple comparisons is to increase the required sample size from 132 to 205, an increase of about 55%. An alternative simple approach (based on Hotelling’s T2 statistic) deals with the components corresponding to all dimensions simultaneously to yield an overall assessment of the treatment effect. For uncorrelated components, this assessment is based on the sum of the squares of the z scores resulting from the differences between the mean change scores on the individual component measures in the treated and placebo groups. This approach can be viewed as one method of combining the components of a multivariate outcome measure into a composite outcome measure (the sum of the squares of the zscores), where the composite is constructed on a purely statistical basis. Because the assessment is based on the squares of the z scores, this can be described as an omnibus comparison of treatment and placebo, in that the analysis would indicate there is a difference between treatment and placebo, but would not indicate the direction of the difference or on which dimensions change occurs. The individual z scores would have to be inspected to determine the behavior of each component measure, and the direction of change. With this approach, an overall significant effect could be obtained in the absence of signifi- cant effects on any of the individual dimensions, and similarly no differences may be seen overall, even when there is a significant change in one dimension. The design implications of this second approach relate primarily to power and sample sizes. The effect of including additional measures using the second approach can be illustrated as follows. If a single measure were to show a treatment effect ([mean change score on treatment - mean change score on placebo] /standard deviation of change score) of l/2, then with a two-tailed test of significance with a = 0.05, a sample size of 100 patients per group would result in a power of 94%. Adding additional uncorrelated components that showed no treatment effect would result in significantly reduced power, or significantly larger sample sizes to maintain the same statistical power. For example, the addition of four uncorrelated component measures that showed no treatment effect would reduce the power from 94% to 79%, or increase the sample size required to maintain the power of 94% from 100 to 153 per group. This illustrates the danger of including dimensions that do not show a treatment effect when this alternative approach for a multivariate outcome measure is used. In contrast, the addition of uncorrelated components that do show treatment effects results in improved power on reduced sample sizes required for the same statistical power. The overall conclusion from this analysis is that there is significant disadvantage to including component measures that do not show a treatment effect on change over time in a multivariate outcome measure, but considerable benefit in including multiple independent component measures that are sensitive to change. Petkau [39] provided a more detailed discussion of these complexities, including comparisons with other approaches to combining the component measures of a multivariate outcome measure into a composite outcome measure [40, 411. In addition, Cutter (personal communication, 1995) considered the effect of the relationship between composite and component change by computer simulations. These studies confirmed that the sample size requirement or power of the clinical trial is directly influenced by correlations between the components in a multivariate outcome measure. The implications of the complexities of a multivariate outcome measure for designing MS studies are summarized as follows (Table 2): (1) There are both risks and benefits in using a multivariate outcome measure for MS studies. In particular, caution should be used to avoid including multiple dimensions that do not change with treatment, as this makes the detection of change more difficult. Caution should also be used to avoid including multiple, highly correlated measures because they add little new information about change while increasing the total variability, again making the detection of change more difficult. (2) All components Table 2. Attributes of a Multivariate Clinical Outcome Measure for MS Trials Measure Attribute Explanation Multidimensional The outcome measure is based on components that measure different key dimensions of the disease. The individual components of the outcome measure change in a significant proportion of the population. Change in the individual components is relatively independent from other components. Available scores should allow classification of all patients and avoid ceiling effects. Individual components change over time Components change independently Applicable to range of MS severity most often included in MS trials of a multidimensional outcome measure should have optimal performance characteristics (e.g., they should have high reliability, validity, and sensitivity [see Table 11). ( 3 ) The inclusion of various measures of the principal clinical dimensions should be based on both expert opinion and careful empirical investigations in which available data on candidate outcome assessments are analyzed to determine their variability and sensitivity to change over time. How Should the MS Clinical Outcome Measure Be Used? It is necessary not only to identify an optimal outcome measure, but also to define how it should be used, and to address its impact on important trial design issues [42].For example, one could simply look at the difference in change scores between treatment groups on the outcome measure or one could define significant change (e.g., a certain amount of worsening or improvement) and compare the proportions in the treatment arms who change by this amount. Alternatively, one could conduct a time analysis (e.g., time to an event, such as treatment failure). The optimal approach will be determined by characteristics of the outcome measure, the specific goal of the clinical trial, and a number of statistical factors. Questions that need to be addressed are discussed below. I . Should study entry be restricted to a range 0fperf.rmanee on the outcome measure? There have been attempts to restrict entry into MS clinical trials to improve subject homogeneity, in order to increase the sensitivity for observing a therapeutic effect. Entry into trials has typically been restricted to a subpopulation based on the outcome measure to be used in the trial Special Report: Rudick et al: Clinical Outcomes Assessment in MS 475 (e.g., 2 2 exacerbations in the past 2 years, EDSS of 1.0-3.5, etc). It will obviously be necessary to restrict entry to trial subjects who can be adequately evaluated by a particular outcome measure, and it seems unlikely that a single outcome measure will be optimal for the entire disease severity spectrum. Whether it will be appropriate to restrict entry to a limited subset based on performance on a particular outcorne measure is unclear at present. 2. Is a “run-in”period necessary? MS patients frequently remain clinically stable for long periods of time, so an observation phase without active treatment, termed a run-in period, has been used to select patients with active disease for inclusion in 1-he trial. Generally speaking, short run-in periods are likely to be noninformative in this regard due to the slow pace of clinical change, but potential advantages include allowing the trial participants time to become comfortable with the study personnel and the measurement techniques, to identify practice effects, and to minimize their effects. Disadvantages of a run-in period include the need for additional resources. Additionally, the course of MS is notoriously variable, and there is no predictable relationship between observed clinical change during a run-in period and subsequent clinical change during the treatment phase [43].A common observation has been that subsets of patients with very active disease during observation have much less active disease during the treatment trial. One explanation for this observation is the phenomenon known as regression toward the mean. This occurs when groups are selected on the basis of their extreme performance on a selection measure (e.g., selecting MS patients with high relapse rates prior to entering a trial). Because only those patients who are above the eligibility level are entered into the trial and followed for change, there is a tendency for the subsequent mean levels to be lower. This results because of the natural variability in the course of disease. Therefore, relapse rates in some of the patients entered into the trial regress back to lower levels. The averaging of those more representative values with values for those who continue to express high relapse rates lowers the overall mean at follow-up because patients, who would have balanced the group had the entire population been represented were excluded from the study because they failed entrance screening. Thus, such selection results in an artificial change in the outcome measure from baseline to follow-up. The same phenomenon applies to all clinical outcome measures. Because of the slow pace of clinical change and the phenomenon of regression to the mean, run-in periods should be reserved for familiarizing subjects with the test procedures and allowing subjects to reach stable baseline levels on the outcome measures. 476 Annals of Neurology Vol 40 N o 3 September 1996 3. Should patients be stratzfed by clinical course? MS clinical trials have commonly restricted entry to particular types of disease based on their clinical course. However, it may not make clinical sense to distinguish between patients with relapsing remitting MS and those with secondary progressive MS. These NVQ types of MS are probably different stages of the same disease and so many simply represent different durations and severities of the disease. Furthermore, classification of subjects into one or the other category is in many cases ambiguous. Where an outcome measure is particularly sensitive to disease change in one subtype versus another, restricted enrollment or stratification is important. O n the other hand, there is evidence that primary progressive MS may be pathologically different from secondary progressive MS, in that there is less brain inflammation and gadolinium enhancement demonstrated by brain MRI [44].Therefore, it would be rational to conduct separate clinical trials for patients with primary progressive MS or stratify by this in a combined trial when the new treatment is of plausible benefit to both groups. 4. How frequently are measurements needed? There may be a need for frequent measurements during short time periods (e.g., when determining a “slope” using a very sensitive outcome measure). However, making measures too frequently may introduce noise in some circumstances and may actually decrease power for showing an effect for change over time. The effect size (discussed already) can be used to evaluate outcome measures from completed clinical trials in simulated trials of variable duration and sampling intervals to help answer this question. For example, in a 2-year clinical trial with evaluation visits every 3 months, the data from the trial can be used to create simulated trial durations of 6 months and 1 year in addition to the 2year trial to compare sensitivity to change of candidate outcome assessments over the shorter time intervals. Similarly, examinations every 6 months or 1 year can be used to evaluate the effects of longer interexamination intervals on sensitivity to change. 5. How does the outcome measure influence dropouts in clinical trials? Dropouts remain a problem in MS clinical trials. Consequently, it is necessary to include a plan to handle the impact of dropouts when initiating an MS clinical trial. It may not be adequate to simply inflate the sample size calculation to compensate for the dropout rate. Inflating the sample size and ignoring the dropouts makes an assumption called noninformative censoring, which means that disease progression in the dropouts behaves in the same manner as in the participants who continue to be followed. In both the cyclosporine study [I31 and the interferon beta-1 b (Betaseron) study [45],disease progression in dropouts appeared to be greater than that in complet- ers, however, so the assumption of noninformative censoring is not always met. The “intent to treat” analysis, which is standard practice in clinical trials, forces the inclusion of dropouts into efficacy analyses. If the disease behaves differently in the dropouts from the study completers, then this has the potential for obscuring treatment effects and reducing the power of the experiment, particularly when disease progression in the dropouts is different in the active and control arms of the study. Insensitive clinical outcome measures can contribute to this problem, because trial participants may perceive that they are worse and withdraw from the study before change is detected by an insensitive outcome measure. Innovative approaches to handling dropouts will be required. Allowing clinical trial participants to pursue alternative treatments when they become “treatment failures” and employing more sensitive outcome measures may be the best approach to this problem. New techniques aimed at longitudinal data analyses do enable the experience of the dropouts to be included, but outcome measures that identify poor responders early may be more helpful than the usual assumptions necessary for dealing with dropouts. 6 Should the same outcome assessment measures be used to detect worsening and improvement?Relatively few clinical trials have addressed the hypothesis that a given therapy produces improvement in MS-related disability, as opposed to slowing or preventing deterioration. Given the tendency for MS-related disability to wax and wane in severity with fatigue and change in body temperature, similar difficulties are inherent in detecting significant improvement as in detecting worsening. Mean change in a composite score will be influenced heavily by the “noise” of minor fluctuations in composite disability scores. Noseworthy and colleagues [46],in a protocol directly assessing the ability of intravenous immunoglobulin (IVIg) to produce improvement in recently acquired but apparently fixed MSrelated weakness, used quantitative isometric strength measurements of “targeted” muscle groups. As the mechanisms underlying neurological improvement (remyelination, spreading of excitatory sodium channels) are likely different from those underlying neurological deterioration (conduction block, demyelination, axonal degeneration), improvement and worsening may not be measurable as a continuum. Strict methodology would require that the frequency of worsening should be analyzed without regard to whether apparent improvement is seen on the same composite measure, when the hypothesis under study is whether a given agent can prevent progression of MS. There is insufficient experience at this time to determine whether the same measurements or composites will be adequate to assess improvement; furthermore, there is neither natural history experience nor experience from control groups in relevant clinical trials to address this question. Conclusions and Plans 1. Current assessments in MS patients are not optimally sensitive and precise to detect changes in disease progression for trial durations less than 3 years. This results in large sample sizes or long study durations, or both. This problem will dramatically escalate with partially effective therapy, assuming that future studies incorporate active treatment comparison groups. Subjects in the comparative group can be expected to worsen more slowly than a placebo-treated group in future trials, requiring more sensitive outcome measures, larger sample sizes, or longer durations. 2. Newer outcome measures and new approaches will be necessary in the future to accelerate progress in developing effective therapies. Change over short time intervals on new outcome measures may not be of obvious clinical significance. It will be important to demonstrate the predictive validity of new or modified outcome measures (i.e., change on the outcome measure must be linked to subsequent clinically significant change). 3 . The clinical dimensions that should be considered for new MS clinical assessment measures include tests of leg function, ambulation, and mobility; arm function; and cognitive function. The arguments for a role of visual, sensory, and bowel and bladder testing are less compelling. 4. Optimal clinical assessment measures for clinical trials may not be optimal for evaluating patients during clinical practice, for clinical decision making, or for classifying disease type. 5. Quantitative functional assessment of neurological function may be a useful alternative to clinical ratings derived from the neurological examination. Quantitative functional assessments may offer advantages in terms of reliability, continuous rather than ordinal scales, and increased sensitivity to change over time, but potential disadvantages include unknown predictive validity, neurologist and patient acceptance, cost-effectiveness, and practicality, which remain to be demonstrated. 6. A methodology to compare different outcome assessment measures for their utility in clinical trials is needed. Candidate measures include SNRs and effect size. Methods for comparing different outcome assessment measures may not be limited to these two methods; alternatives need to be explored and the optimal method(s) utilized. 7. Multivariate clinical outcome measures present complexities and challenges. The use of a multivariate outcome measure may increase or decrease the Special Report: Rudick et al: Clinical Outcomes Assessment in MS 477 power of an intervention trial, depending on whether the intervention affects multiple measures (increasing power), and how rnany measures not affected by the treatment (decreasing power) are included. The number of dimensions should be limited, and analyses should be conducted to confirm that the measures have favorable performance characteristics and change ovei time in untreated or placebo-treated MS patients. 8. It will be necessary to develop flexible outcome measures in order to meaningfully measure the spectrum of disease severity. As individual patients “bottom out” on measures appropriate for lowdisability patients, more appropriate measures of the same clinical dimension would be substituted. 9. Analysis of existing data that have been collected from MS patients participating in controlled clinical trials and natural history studies will guide the Task Force in recommendations for optimal clinical assessment measures. In this regard, the Task Force developed and initiated a plan to evaluare data from completed clinical trials and natural history studies. The goals for this project are to analyze the behavior of placebo recipients and untreated MS patients using various clinical measures, to determine performance characteristics of these measures. The outcome from this analysis will be reconciled with the basic principles formulated in this position statement and used to formulate guidelines for optimal clinical outcome measures. The principles described here will be reconciled with the actual perfbrmance of standard clinical rating scales (e.g., EDSS, neurologic rating and quantitative tests of neurological scale [47]), function (e.g., validated, timed tests of physical and cognitive performance). 10. It is anticipated that the Task Force will recommend criteria and specific measures that can be considered for inclusion in subsequent controlled clinical trials by investigators and sponsors. It will initially be necessary to utilize the recommended assessment measures concurrently with standard measures, so that the relative utility can be determined prospectively. The work of the Task Force is supported by the US National Multiple Sclerosis Sociery wirh an unrestricted educational grant from Berlex Laboratories. The Task Force is grateful to Drs Theodore Munsat and John Whitaker for their helpful suggestions about the manuscript. References 1. Weinshenker BG. Clinical outcome measures for multiple sclerosis. In: Goodkin DE, Rudick RA, eda. Multiple sclerosis: advances in clinical trial design, treatment, and future perspectives. London: Springer, 1996 (in press) 478 Annals of Neurology Vol 40 No 3 September 1996 2. Whitaker J N , McFarland H F , Rudge P, Reingold SC. Outcomes assessment in multiple sclerosis clinical trials: a critical analysis. Multiple Sclerosis 1995;1:37-47 3. Munsat TL, Subcommittee on motor neuron diseases of the World Federation of Neurology research group on neuromuscular diseases, Airlie House “Therapeutic trials in ALS” workshop contributors. Airlie House guidelines. Therapeutic rrials in amyotrophic lateral sclerosis. J Neurol Sci 1995;129:1-10 4. Guyart G H , Walter S, Norman G. Measuring change over rime: assessing the usefulness of evaluative insrrumenrs. J Chronic Dis 1987;40:171-178 5. Fitzpatrick Ii, Ziebland S, Jenkinson C, Mowat A. Importance of sensitivity to change as a criterion for selecring health status measures. Qua1 Health Care 1992;1:89-93 6. The IFNB Multiple Sclerosis Study Group. lnrerferon p 1b is effecrive in relapsing-remitting niulriple sclerosis. I. Clinical results of a multicenter, randomized, double-blind, placebocontrolled trial. Neurology 1993;43:655-661 7. Jacobs LD, Cookfair DL, Rudick RA, et al. Intramuscular interferon beta-la for disease progression in relapsing multiple sclerosis. Ann Neurol 1996;39:285-294 8. Johnson KP, Brooks RR, Ford CC, er al. Copolymer 1 reduces relapse care and improves disability in relapsing-remitting multiple sclerosis: results of a phase 111 multicenter, double-blind, placebo controlled trial. Neurology 1995;45: 1268-1 276 9. Kurtzke JF. O n the evaluation of disability evaluarion in multiple sclerosis. Neurology 1961;11:6S6-694 10. Henderson WG, Fisher SG, Cohen N , et al, VA Cooperative Study Group on Cochlear Implantation. Use of principal components analysis to develop a composite score as a primary outcome variable in a clinical trial. Control Clin Trials 1990; 11:199-214 I I . Kurtzke JF. Neurological impairment in multiple sclerosis and the disability status scale. Acta Neurol Scand 1970;46:493512 12. Dixon WJ, Kuzma JW. Data reducrion in large clinical trials. Community Srat 1974;3:301-324 13. Syndulko K, Tourrellotte WW, Baumhefner RW, et al. Neuroperformance evaluation of multiple sclerosis disease progression in a clinical trial: implications for neurological outcomes. J Neural Rehabil 1993;7:153- 176 14. Rao SM, Leo GJ, Haughton VM, et al. Correlation of magnetic resonance imaging with neuropsychological testing in multiple sclerosis. Neurology 1989;39:161-166 15. Fischer JS. Use of neuropsychological ourcome measures in multiple sclerosis clinical trials: current status and strategies for improving MS trial design. In: Goodkin DE, Rudick RA, eds. Treatment of multiple sclerosis: advances in trial design, results, and future perspectives. London: Springer, 1996 (in press) 16. Thompson A. Evaluating neurological outcome measures: the hare essentials. J Neurol Neurosurg Psychiatry 1996 (in press) 17. Nunnally JC. Psychometric theory. New York: McGraw-Hill, 1967 18. Tourtellotre WW, Haerer AF, Simpson JF, et al. Quantitative clinical neurological testing. I. A study of a battery of tests designed to evaluare in part the neurologic fiinction of patients with multiple sclerosis and its use in a therapeutic trial. Ann NY Acad Sci 1965;122:480-505 19. Stewart AL, Hays RD, Ware JE Jr. Methods of validating M O S health measures. In: Stewart AL, Ware J E Jr, eds. Measuring functioning and well-being. The medical outcomes study approach. Durham, NC: Duke University Press, 1993: 309-325 20. Hays RD, Steward AL. Construct validity of MOS health measures. In: Stewart AL, Ware JE Jr, eds. Measuring fiinctioning and well-being. The medical outcomes study approach. Durham, NC: Duke University Press, 1993:325-345 21. Albert MS. Criteria for the choice of neuropsychological tests in clinical trials. In: Mohr E, Brouwers P, eds. Handbook of clinical trials. The neurobehavioral approach. Berwyn, PA: Swets & Zeitlinger, 1991:131-139 22. World Health Organization. International classification of impairments, disabilities, and handicaps. Geneva: World Health Organization, 1980 23. Noseworthy JH, Vandervoort MK, Wong CJ, et al. Interrater variability with the expanded disability status scale (EDSS) and functional systems (FS) in a multiple sclerosis clinical trial. Neurology 1990;40:971-975 24. Belendi~ikG, Klatzman D, Mietlowski W, the Multiple Sclerosis Study group. Rating scales in assessment of multiple sclerosis. In: Davis R, Kondraski GV, Toutellotte WW, Syndulko K, eds. Quantifying neurologic performance. Philadelphia: Hanley and Belfus, 19833177-184 25. Anderson JJ, Chernoff MC. Sensitivity to change of rheumatoid arthtitia clinical trial outcome measures. J Rheumatol 1993;20:535-537 26. DKYO RA, Centor KM. Assessing the responsiveness of functional scales to clinical change: an analogy to diagnostic test performance. J Chronic Dis 1986;39:897-906 27. Farrell AD. Structural equation modeling with longitudinal data: strategies for examining group differences and reciprocal relationships. J Consult Clin I’sychol 1994;62:477-487 28. Hageman WJ. A further refinement of the Reliable Change (RC) Index by improving the pre-post difference score: introducing KC-sub(1D). Behav Kes Ther 1993;31:693-700 29. Gottman JM, Rushe RH. The analysis of change: issues, fallacies, and new ideas. J Consult Clin Psycho1 1993;61:907-910 30. Kazis LE, Anderson JJ, Meenan RF. Effect sizes for interpreting changes in health status. Med Care 1989;27(suppl):Sl78-S189 31. Meenan RF, Anderson JJ, Kazis LE, et al. Outcome assessment in clinical trials. Evidence for the sensitivity of a health status measure. Arthritis Rheum 1984;27: 1344-1352 32. Susskind EC, Howland EW. Measuring effect magnitude in repeated measures ANOVA designs: implications for gerontological research. J Gerontol 1980;35:867-876 33. Syndulko K, Ke D, Ellison GW, et al. Comparative evaluation of neuroperforinance and clinical outcome assessments in multiple sclerosis. 111. Effect size for disease progression and treatment efficacy and its relationship to clinical trial duration and inter-examination intervals. Multiple Sclerosis 1996 (submitted) 34. Ottenbacher KJ, Barrett KA. Measures of effect size in the reporting of rehabiliration research. Am J Phys Med Rehabil 199 1;70(suppl):131-1 37 35. Cohen J. Statistical power analysis for the behavioral sciences. 2nd ed. Hillside, NJ: Lawrence Erlbaum Associates, 1988 36. Kurtzke JF. Rating neurologic impairment in multiple sclerosis: an expanded disability status scale (EDSS). Neurology 1983; 33: 1444-1452 37. Coste J, Fermanian J, Venot A. Methodological and statistical problems in the construction of composite measurement scales: a survey of six medical and epidemiological journals. Stat Med 1995;14:331-345 38. Roberts RS. Pooled outcome measures in arthritis: the pros and cons. J Rheumatol 1993;20:566-567 39. Petkau AJ. Statistical and design considerations for multiple sclerosis clinical trials. In: Goodkin DE, Kudick RA, eds. Multiple sclerosis: advances in clinical trial design, treatment, and future perspectives. London: Springer, 1996 (in press) 40. O’Brien PC. Procedures for comparing samples with multiple endpoints. Biornetrics 1984;40: 1079-1087 41. Goldsmith C H , Smythe HA, Helewa A. Interpretation and power of a pooled index. J Rheumatol 1993;20:575-578 42. Ellison GW, Myers LW, Leake BD, et al. Design strategies in multiple sclerosis clinical trials. Ann Neurol 1994;36:S108s112 43. Goodkin DE, Hertsgaard D, Rudick R4. Exacerbation rates and adherence to disease type in a prospectively followed-up population with multiple sclerosis. Implications for clinical trials. Arch Neurol 1989;46:1107-1112 44. Thompson AJ, Kermode AG, Wicks D, et al. Major differences in the dynamics of primary and secondary progressive multiple sclerosis. Ann Neurol 1991;29:53-62 45. Rudick RA, Sibley W, Durelli L. Treatment of multiple sclerosis with type 1 interferons. In: Goodkin DE, Rudick RA, eds. Multiple sclerosis: advances in clinical trial design, treatment, and future perspectives. London: Springer, 1996 (in press) 46. Noseworthy JH, Rodrigues M, An K-N, et al. IVIg treatment in multiple sclerosis: pilot study results and design o f a placebocontrolled, double-blind clinical trial. Ann Neurol 1994;36: 325 (Abstract) 47. Sipe JC, Knobler KL, Braheny SL, et al. A neurologic rating scale (NRS) for use in multiple sclerosis. Neurology 1984;34: 1368-1372 Special Report: Rudick et al: Clinical Outcomes Assessment in MS 479

1/--страниц