Design Strategies in Multiple Sclerosis Clinical Trials George W. Ellison, MD,” Lawrence W. Myers, MD,” Barbara D. U e , PhD,? M. Ray Mickey, PhD,t Dershin Ke, MD,S Karl Synddko, PhD,XS Wallace W. Tourtellotte, MD, PhD,*S The Cyclosporine Multiple Sclerosis Study Group After analyzing our natural history data on the course of multiple sclerosis (MS) in more than 500 patients followed for 20 years and our experience in several therapeutic trials, we concluded that a phase 111 (full) trial for efficacy should have certain properties. For a power of 0.8, a of 0.05, and attrition rate of 10% per year, we think the trial should have a minimum sample size of 130 (65 in each arm, placebo versus active) if the design is based upon the proportion of subjects worsening by clinical measures. N o stratification by entry Extended Disability Status Scale score is needed if worsening is defined as a change of 1.0 units (2 to 0.5 steps) maintained for 90 days for an entry score of 1 to 5.0 units; or 0.5 units (1 to 0.5 steps) if the entry score is 5.5 to 7 units. We need not stratify by course (relapsing-remitting versus relapsing-progressive) but are less certain about progression from the onset. No run-in period is required to define “activity.” Minimum time for treatment is 3 years. We review the justification for our conclusions; modifications in sample size that are necessary if survival analysis is used; impact of the interferon# trial (future trials will have an “active” control); and alternative strategies possible if magnetic resonance imaging serves as the primary outcome. Ellison GW, Myers LW, h ak e BD, Mickey MR, Ke D, Syndulko K, Tourtellotte WW, The Cyclosporine Multiple Sclerosis Study Group. Design strategies in multiple sclerosis clinical trials. Ann Neurol1994;36:S108-S112 My colleagues and I have made estimates of the number of subjects needed for a phase I11 clinical trial for efficacy. Because the Kumke’s Disability Status Scale (DSS) and Extended Disability Status Scale (EDSS) scores are widely used in clinical trials,we have emphasized these scales. We have reached an operational definition of progression with the scales. We have also gained insight into the influence of clinical classification by type of multiple sclerosis (MS), course, and phase; the need for run-in periods; and the effects of different scores at entry into the trial upon the eventual outcome. Our recommendations derive from our study of the natural history of MS in 569 patients with 6,913 visits since 1971 [l]. Dr Lawrence Myers and I (G. W.E.) performed the vast majority of examinations (we enjoyed Dr Pierre Duquette’s help for 2 years). We also used information from a therapeutic trial on methylprednisolone and azathioprine [2] and from a cyclosporin-A trial [3}. To trace the evolution of our appraoch, we will take the standpoint of a biostatistician who is advising MS researchers about clinical trials. We think we may soon run out of appropriate candidates for therapeutic trials just as the number of interesting agents is rapidly increasing. In our center alone, we are considering 86 different molecular sites for intervention as a treatment for MS. Since the regimen and safety of each agent must be tested in people with MS, 40 to 50 volunteers could be involved before we reach an efficacy trial. Recent full trials have recruited over 300 patients. Clearly, we must design our trials so that we enroll the minimum number of patients while still performing an exemplary efficacy trial (minimize our sample sizes). There are several ways to do this: 1. mathematically, using continuous variables; 2. using variables that are precise (they have a small standard deviation); and 3. using variables tightly grouped around their means over the duration of the treatment. For example, we rmght take a score at entry and subtract or add or somehow manipulate the score at the end so that we have a paired measurement or a slope. Also, we think about unequal group sizes now that we have a treatment that is thought to be effective, and we might try to find a more common outcome, perhaps something in the laboratory [4] (magnetic resonance imaging?). As a statistician, how rmght I design MS clinical trials for a minimum number of patients? First, I would change the approach from using variables (clinical and laboratory measures) given to me by the neurologists From the Departments of ‘Neurology and tBiomathematics, School of Medicine, University of California, Los Angeles, CA, and the $Neurology Service, Wadsworth Veterans Administration Medical Center, Los Angeles, CA. Address correspondence to Dr Ellison, 10833 Le Conte Ave, Los Angeles, CA 90024-1916. Sl08 Copyright 0 1994 by the American Neurological Association and then deciding which statistical test seems appropriate to deciding which statistical test(s) would be “optimal” and then workmg to fulfill all the assumptions upon which the test is predicated. I recognize that the statistical tests we might use to determine whether our results should be attributed to chance alone depend upon the type (discrete, categorical, nominal, ordinal, continuous, interval, ratio) and distribution (binomial, normal, Gaussian; parametric, nonpacametric) of the data 151. As investigators of MS, all of us generate discrete data when classifying patients as to type of MS (clinically definite, probable, possible, laboratorysupported definite). We also classify the course (relapsing-remitting, relapsing-progressive, progressive from onset) and deal with the phases (relapse, plateau, progression) [6].We rmght analyze such data by calculating, comparing, and contrasting proportions; for example, the number of patients with relapses per group, or the number of patients with progression per group. We mght also count the total number, or the frequency, of events, such as the number of relapses per group. To decrease the sample size for a trial, I, as a statistician, prefer continuous interval data. Examples might be: time to relapse, time to progression, slope (change in the EDSS score per unit of time), or time to walk 25 or 50 m. John Kwzke’s scales give ordinal data. That is, the scores are discrete data, but each is ordered by lesser or greater amounts of “disability” (neurological impairment) than the others. A change of score from 1 to 2 on the DSS or EDSS indicates the patient is worse; a change from 7.0 to 6.5 indicates the patient is better. I am aware that ordinal data analyzed with nonparametric smtistical tests may be as powerful or more powerful than parametric tests, but I want to aim for minimal sample sizes by using continuous interval data. There is disagreement on when an ordinal variable becomes continuous. Actually, every datum is discrete. As our measurement techniques improve, values usually can be determined more accurately and precisely. But there is always uncertainty about the exact value because of measurement variation. How many data points along the continuum do we require to make ordinal data continuous? Approximately 6. Since the DSS has 9 usable steps and the EDSS has 19, we may be tempted to use the scores as continuous data. For statistical purposes, however, in addition to the scores reflecting an underlying continuum of impairment, we would like changes between the scores to be equidistant (interval). That is, we want the magnitude of the change between 1.0 and 2.0 to be the same as it is between 7.0 and 8.0. With the DSS or EDSS, we do not know that it is. One might ask what difference does the type of data Table 1 . Dismte Versus Continuous Data in Determining Sample Sizes Discrete L t a Proportion of patients worsening 2 1.0EDSS steps P, (Placebo) = 0.5 P, (HOORAY) = 0.25 Recommended sample size = 58 Total sample size = 116 Continuous variable Each group’s average (standard deviation) in a timed walk: Placebo = 20 sec ( ? 30) HOORAY = 10 sec ( 2 15) Recommended sample size = 32 Total sample size = 64 For the trial of a fictitious new agent named HOORAY, a 50% improvement is expected after 3 years of treatment; a = 0.05,power is set at .80. EDSS = Extended Disability Status Scale score. make? Assume that we are going to do a trial of a new agent named “HOORAY”from which we expect 50% improvement after 3 years’ treatment (Table 1). The probability that we will falsely attribute the result to chance alone is going to be 0.05. We translate that to mean we w d be 95% certain that HOORAY really works. Also, we want to make sure we do not miss an effective treatment and prematurely discard HOORAY. We want to say this result is correct 80% of the time, so we set the power to detect a real effect, if it is there, at 0.8. As shown in Table 1, if we use discrete data and expect that the proportion of patients worsening equal to or more than 1.0 EDSS steps in the placebo-treated group will be 0.5 (or 50%) and in the HOORAYtreated group only 0.25 (25%), we will have achieved the 50% improvement. The recommended sample size for each group would be 58 patients 141. If we go against a placebo, we would need a total of 116 patients. Now let us use a continuous variable to detect the 50% improvement-each group’s average for the time it takes to walk 50 yards. Placebo-treated patients would take 20 seconds. If HOORAY works, the patients go the distance in only 10 seconds, on average. The recommended number of subjects per treatment group will be 32, or a total sample size of 64. Although these values are not adjusted for different effect sizes [4], placebo effect [7], or attrition, one can readily see that continuous data make quite a difference. We would like to present data that we think make a run-in period to confirm progression unnecessary. For each patient followed at UCLA, we calculate a slope, the DSS score change in units per year. We define “worse” as an increase of more than 0.5 units in 1 year. Over 2 years, we would expect a worsening patient’s score to increase more than 1 DSS unit. In Table 2, note that of 288 patients followed for at least Ellison et al: Design Strategies in MS Trials S1W Table 5. Course of 87 Patients in Pwgresrion Phase Followea’for 2 Years Table 2. Natural Histoty Outcome of 288 PatientJ Followed for 2 Years“ DSS Score DSS Score Better Same Worse Unitnear % < - 0.5 5 60 35 0.5 > 0.5 2 ‘These patients were from a study of 569 patients with 6,913 visits since 1971. Better Same Worse UnitsrYear % <- 1 82 17 2 0.5 0.s > 0.5 DSS = Disability Status Scale. DSS = Disability Status Scale. Table 6. Time to Sustained Worsening DSS Score Tabk 3. Natural History Outcome of 172 Patients FolIowed for 4 Years” DSS Score Better Same Worse UnitsrYear % <- 2 70 28 0.5 f 0.5 > 0.5 These patients were from a study of 569 patients with 6,913 visits since 1971. DSS = Disabdiry Status Scale. Table 4. Course of 91 Patients with Relapsing-Remitting Course Followedfor 2 Years DSS Score UnitsrYear % Better < Same Worse 0.5 > 0.5 3 72 ~~ DSS f - 0.5 25 = Disability Status Scale. 2 years, 35% worsened El}. In Table 3, of 172 patients followed for at least 4 years, 28% worsened [l}.There is not much difference. So we do not necessarily select for worsening by spending 6 months or a year following candidates to see if they really do deteriorate. Worsening in the past (activity) is no guarantee of change once a person enters a trial E23. When we are looking at progression, the type of clinical course or phase also does not predict subsequent worsening very well {8]. In Table 4, of 91 patients with a relapsing-remitting type of course followed for 2 years, 25% w ill worsen, 72% will remain the same, and 3% will get better. If we select patients because they are in “progression phase,” as in Table 5, we find 17% worse; not much difference. Type of course is not predictive of outcome 2 years hence. Therefore, we think the only entry criterion for a trial focused upon progression as defined by increases Mean GEM) n 1 2 30 3.0 (1.5) 55 4.0 (0.6) 3 61 27 45 129 48 4.4 (0.7) 4 5 6 7 2.7 (0.6) 3.3 (0.5) 5.1 (0.4) 5.8 (1.0) 50% of 75% of Sample (yr) Sample (yr) 1.7 2.2 2.2 2.1 2.2 4.4 9.1 11.5 3.1 7.7 10.0 8.6 4.6 3.6 DSS = Disability Status Scale; SEM = standard error of mean. of the DSS or EDSS should be a diagnosis of multiple sclerosis. There is no question that a patient’s DSS score at entry into the trial can have a drastic effect upon the outcome [9, lo]. Table 6 shows the mean, median (50th percentile), and 75th percentile number of years for patients with DSS scores from 1 to 7 at entry into the clinic to advance one or more steps and maintain the advance 90 days (sustained worsening) 111, 121. For example, 50% of patients (15 of 30) entering the clinic with a DSS of 1 advanced to 2 or more in 1.7 years. If a patient enters with a DSS of 2 to 5, the median time to change is close to 2 years. For patients entering with DSS scores of 6, there was a wait of nearly 5 years before half worsened. If patients entered with a 7, it took 3.6 years. These results should be transferable to EDSS scores, since a 1-unit change in DSS is equivalent to a 1.0-unit (two 0.5 steps) change in EDSS. Consider the impact of these values on a therapeutic trial. If the vial lasts 2 years and we enter many patients at EDSS 6.0 or greater, we would not expect much change in either the control or experimental treatment group. If we conduct an open trial (without placebotreated controls) with patients with EDSS scores equal to or greater than 6.0, and our goal is stabilization of their course for 2 years, we are quite likely to think the intervention succeeded. Unless more patients than expected worsen, suggesting that the intervention is harmful, we could mistakenly attribute the stabilization SllO Annals of Neurology Supplement to Volume 36, 1994 Table 8. Kapkan Meier Estimatesfor Worsening Table 7. Types of Worsening“ Duration of change Changed (%) Simpleb 3 months 6 months 63 50 44 Returned to prechange score (%) 33 18 11 ‘Data are for patients with entry Disability Status Scale scores of 1 to 6, followed for 3 years. bFrorn one clinic evaluation to the next, the Disability Status Scale score increased by 1 or more. to the treatment. All we have seen is the naturd history of relatively slow worsening in patients entering the trial with DSS scores of 6 or greater. Let us pause for a moment and consider how to define progression. With the increasing use of survival analysis, we need a circumscribed event (like death) that indicates treatment failure. For example, we looked at changes e q d to or greater than 1 in the DSS score over 3 years in patients entering with DSS scores of 1 to 6. In Table 7 are the percentages of patients changing (the original sample sizes were more than 100 patients) {12J Simple worsening means the patients worsened 1 or more DSS units. Sixty-three percent of the patients changed, but they returned to their baseline prechange score within 3 months onethird of the time. If we demanded they maintain the change for 3 months, 50% worsened and 18% returned to their baseline score (sustained worsening). If we required the change be maintained for 6 months, 44% changed and 11% returned to baseline. Seven to thirteen percent of the time, we misclassify a stationary patient as worse by an increase in the DSS score { 131. Misclassification because of improvement or worsening of the DSS score in a patient thought to be clinically stable occurs 22 to 28% of the time [131. We chose 3 months (90 days) or more as the time the change in DSS score would have to be maintained to qualify the change as sustained worsening for declaring treatment f d u r e with survival analysis. In survival analysis, the Kaplan-Meier technique takes into account random attrition from a study population and may give a more accurate estimate of the probability that a patient will worsen. In Table 8, we present the percentages of patients with worsening of 1 or more DSS steps sustained for more than 90 days within 1-, 2-, or 3-year follow-up if the entry DSS was 3 to 6 [lo]. At 1 year, 24% will worsen; at 2 years, 36%; and at 3 years, 50%. These results are also influenced by the patients’ entry DSS scores (Table 9) [ll, 121. With the recent emphasis on early treatment of patients in a relapsing-remitting course who have EDSS scores less than 5.5, many of our patients with EDSS of 262 Patients Patients Worsening (%I 24 36 50 1 year 2 years 3 years ‘Data are for patients with entry Disability Status Scale scores of 3 to 6 who sustained worsening of 1 or more steps for more than 90 days. Table 9. Kapkan-MeierEstimates f.r WorJening of Patients with Different Extended DisabiIity Status Scale Scorn at Entty EDSS score Units changed 3-5.0 5.5-7-0 EDSS = Extended ~ Patients Worsening (%) 1 year 2 years 1.0 0.5 24 30 35 44 k h status i ~scale. ~ TabIe 10. Sample3ize Estimates Using Proportions Worsening“ Trial duration Size of group 2 Years 3 Years 91 65 ‘Power, 80% chance to detect reduction; a = 0.05. A 50% reduction in rate of worsening for parallel groups. scores of 6.0 or greater feel left out and are anxious to join a therapeutic trial. If we demand a 1.0-unit worsening in the EDSS for “treatment fdure,” the latter group of patients would probably be excluded from any trial lasting less than 5 years. We thought a 0.5-unit increase in the EDSS score if the entry score was 5.5 to 7.0 might indicate the same worsening as a 1.0 increase if entry score was equal to or less than 5.0. In patients randomized into the cyclosporin-A placebotreated group who entered with an EDSS score between 3.0 and 5.0, 24% worsened by 1.0 unit (two 0.5 steps) in 1 year, 35% in 2 years. If we required the patients who entered with scores of 5.5 to 7 to increase by only 0.5 unit maintained for 3 months, 30% worsened within 1 year, and 44% within 2 years n41. With the above information, we can design an efficacy trial using estimates based on proportions (Table 10). If we set power at 0.8 and do not make a falsepositive error more than 5 times out of 100 (a = 0.05), we could detect a 50% reduction in treatment failures in a placebocontrolled parallel group trial by entering 91 patients per group for 2 years and 65 paEllison et d: Design Strategies in MS Trials S l l l Table 11. Charactertitics of Active-Contd Equivalence Studies Interferon-p decreases relapse frequency compared to placebo Treatment “ X will be compared to interferon-p Treatment “X” will be compared to the placebo indirectly-“historical control assumption” May limit design options SoIution is to report confidence intervals Increase power to 0.95 If interferon-p is “standard of care,” must include it in clinical trials of new agents tients per group for 3 years 1151. Double those numbers for your total sample size if you have 2 groups. With the successes of interferon-p for reducing relapse frequency and severity, and of highdose adrenal steroids for decreasing relapse severity, and with the hope that copolymer-1 and oral myelin will be efficacious, combination treatments are frequently mentioned. Our understanding of United States law is that in a trial of combination therapies, one must show efficacy for each agent alone as well as for the combination. Efficacy trials will increase in size and complexity. Be sure to check with the Food and Drug Administration on current requirements early in your trial design. Now that interferon-p (Betaseron, Berlex Laboratories) has been licensed for exacerbating-(relapsing-) remitting type of MS, we will have to show that our new treatment “ X is equivalent to or better than interferon$, which has become the “active control.” Must we, or dare we, also include a placebo-treated group in future trials? Probably not. As expressed in Table 11, we will make an “historical control assumption” that the real comparison is with the placebo-treated group in the original interferon-p trial [16]. This assumption may limit our trial design options. One way out of this dilemma is to carefully compare confidence intervals of the results 1171. We could increase the power 1171. We started with a power of 0.8 or 80%. If we increase the power to 0.9 or 0.95, then we can be more certain that the two drugs are equivalent. There are consequences: If we make that leap to 0.95 power, our sample size per arm increases to 372 patients, or 744 patients per trial! Dr David Camenga pointed out that if interferon-p is the “standard of care” for MS, it must be included in all trials of new agents for exacerbating-remitting MS. In conclusion, we wanted to j u s G our recom- S112 mended values for sample sizes that were given in the abstract and to consider several experimental designs for efficacy trials in MS.Whichever design we use, we must minimize the sample size in future trials. This work was supported by United States Public Health Service grants NS-16776 and NS087 11, the Conrad N. Hilton Foundation, the Sandoz Pharmaceutical Corporation, and various donors. References 1. Ellison GW, Myers LW, Mickey MR, et al. The variable course of multiple sclerosis. Neurology 1989;39:357 (Abstract) 2. Ellison GW, Myers LW, Mickey MR, et al. A placebocontrolled, randomized, doublemasked, variable dosage, clinical trial of azathioprine with and without methylprednisobne in multiple sclerosis. Neurology 1989;39:1018-1026. 3. The Multiple Sclerosis Study Group. Efficacy and toxicity of cyclosporine in chronic progressive multiple sclerosis: a randomized, double-blind, placebo-controlled clinical trial. Ann N e w 1 1990;27:591-605. 4. Browner WS, Black D, Newman TB,et al. Estimating sample size and power. In: H d e y SB, Cummings SR, eds. Designing clinical research. Baltimore: Williams & Wilkins, 1988:146-149 5. A66 AA, Clark V. Computer-aided multivariate analysis. New York Van Nostrand Rheinhold Company, 1984:12-78 6. Ellison GW, Myers LW. Taxonomy and multiple sclerosis. In: Bauer HJ, Poser S, Rimer G, eds. Progress in multiple sclerosis research. Berlin: Springer-Verlag, 1980:629-63 1 7. Myers LW, Ellison GW, Leake BD, et al. Placebo effect in multiple sclerosis (MS). Can J Neurol Sci 1993;20:5158 ( A b stract) 8. Myers LW, Ellison GW, h a k e BD. Progression phase of mdti- ple sclerosis not a useful entry criterion for therapeutic trials. Ann Neurol 1993;34:312 (Abstract) 9. Ellison GW, Myers LW, h a k e BD. Disability Srarus Scale influence on rate of worsening of multiple sclerosis patients. Ann Neurol 1992;32:259 (Abstract) 10. Myers LW, Ellison GW, h a k e BD. Sample size estimates for therapeutic trials for multiple sclerosis. Ann N e w 1 1992;32: 258 (Abstract) 11. Myers LW, Leake BD, Ellison GW. Use of survival analysis to describe the c o m e of multiple sclerosis. Ann N e w 1 1992;32: 259 (Abstract) 12. Ellison GW, Myers LW, Leake BD. Defining progression for multiple sclerosis (MS) therapeutic trials. Can J Neurol Sci 1993;20:5130 (Abstract) 13. Myers LW, Ellison GW, Leake BD. Reliability of the Disability Scale (DSS). Neurology 1993;43:A204 (Abstract) 14. Ellison GW. Myers LW, Leake BD, et al. Revised recommendations for therapeutic trials for multiple sclerosis (MS). Ann Neurol 1993;34:312 (Abstract) 15. Fleiss JL Statistical methods for rates and proportions, ed 2. New York Wiley & Sons, 1981:264-268 16. Makuch RW, Pledger G, Hall DB, et al. Active control equivalence studies. In: Pease KE, ed. Statistical issues in drug research and development. New York Marcel Dekker, 1990:225-262 17. Makuch R, Simon R. Sample size requirements for evaluating a conservative therapy. Cancer Treat Rep 1978;62:1037-1040 Annals of Neurology Supplement to Volume 36, 1994

1/--страниц