Radiographic progression depicted by probability plotsPresenting data with optimal use of individual values.код для вставкиСкачать
ARTHRITIS & RHEUMATISM Vol. 50, No. 3, March 2004, pp 699–706 DOI 10.1002/art.20204 © 2004, American College of Rheumatology SPECIAL ARTICLE Radiographic Progression Depicted by Probability Plots Presenting Data With Optimal Use of Individual Values Robert Landewé and Désirée van der Heijde to the way in which the data are descriptively presented to the medical readership. Both in RA and in AS, radiographic progression scores are not normally distributed. Only a small fraction of all patients show substantial progression of damage, and the majority show no progression at all. One of the assumptions underlying the use of means and standard deviations—descriptive parametric statistics with which most clinicians are familiar—is a normal distribution of the data. Therefore, progression scores should not be presented (only) as mean scores with standard deviations. Means and standard deviations calculated for a set of radiographic scores are extremely sensitive to subtle changes at the upper extreme, as demonstrated by us previously (8). A better way of presenting radiographic data is by medians (the value cutting off the 50th percentile) and 25th and 75th percentiles, or by box-and-whisker plots that in addition present the 5th and 95th percentiles as well as the extreme values. Some investigators present logarithmically transformed data which may result in a data set with a normal distribution, but these are even more difficult to interpret. The most important disadvantage of presenting data as percentiles in comparison with means and standard deviations is that percentiles only relate to 1 observation in the distribution (e.g., the median observation) and neglect the majority of the variable’s values. Means and standard deviations are inferential statistics that include all of the variable’s values and describe the internal coherence of the data. Since the presentation of percentiles does not allow a proper judgment of the coherence of the data, it may easily conceal irregularities in the frequency distribution of radiographic scores. This may become important if cutoff levels for clinically important progression scores are chosen: a small change in the selected cutoff level may have a major effect on the results. Standard presentation of data (with percentiles only) that are not normally distributed thus gives Introduction Radiographic damage is inherent to inflammatory rheumatic diseases, such as rheumatoid arthritis (RA) and ankylosing spondylitis (AS). Structural damage evolves slowly over a long period of time, but with marked interindividual variation (1,2). Radiographic progression has become increasingly important in evaluating the efficacy of disease-modifying antirheumatic drugs (DMARDs) and, more recently, biologic agents, in the treatment of RA. Since biologic agents have been shown to be effective in AS as well, it is to be expected that radiographic progression will become an important outcome for evaluating the potential of these drugs to prevent structural damage in AS. Various scoring systems have been developed for assessment of both RA and AS. Examples are the Sharp score (with modifications) and the Larsen score (with modifications) for evaluating progression in RA (3,4), and the Bath Ankylosing Spondylitis Radiology Index and the Stoke Ankylosing Spondylitis Spine Score (SASSS) (with modifications) for evaluating progression in AS (5–7). Sets of radiographs (hands and feet for RA, and pelvis and lumbar and cervical spine for AS) obtained at regular time intervals are scored, and the sum score per patient reflects total damage at a time point. The within-patient difference occurring between 2 or more observations is considered to be the individual change (progression) score. A number of difficulties limit the interpretability of radiographic scores in clinical studies. The first relates Robert Landewé, MD, PhD, Désirée van der Heijde, MD, PhD: University Hospital Maastricht and Research Institute Caphri, University of Maastricht, Maastricht, The Netherlands. Address correspondence and reprint requests to Robert Landewé, MD, PhD, Department of Internal Medicine/ Rheumatology, University Hospital Maastricht, PO Box 5800, 6202 AZ Maastricht, The Netherlands. Submitted for publication April 18, 2003; accepted in revised form November 20, 2003. 699 700 rise to a significant loss of information as compared with presentation of data derived from a normal distribution by means and standard deviations. Therefore, there is a consensus that at minimum, presentation of radiographic data should include both the mean and standard deviation and the median and interquartile range (9). Another problem is measurement error, the phenomenon whereby different observers score the same radiographs differently, or 1 observer who scores the same radiographs twice arrives at different scores. Measurement error is inherent to scoring radiographic progression, because typical features of damage, such as erosions and joint space narrowing in RA, and erosions, squaring, and sclerosis in AS, are often subtle and prone to subjective interpretation (interobserver error). Moreover, positioning and quality of consecutive radiographs are almost never identical. In the ideal situation of a randomized controlled trial (RCT), the issue of measurement error is of minor importance because the subject of interest is treatment effect, and when treatment groups are compared, measurement error is equally divided across these groups as a consequence of randomization and blinding of readings. In uncontrolled observational studies or in analyses within 1 treatment group in a comparative trial, however, measurement error can become crucial. In order to gauge part of measurement error, there is some consensus that radiographs in RA clinical trials should be scored by at least 2 readers, and that the average score obtained by all readers should be used in analyses (9). There is no definite consensus regarding whether the readers should be aware of the time order of the radiographs. Scoring of radiographs with known time order increases sensitivity to change because it encourages the readers to increase scores in individual patients over time, but it neglects part of measurement error and may therefore overestimate “true progression” (10). Scoring with concealed time order better reflects measurement error, but the true signal (progression) may easily become lost in the noise of error, and intuitively, false-negative progression scores can occur (10). Interobserver measurement error (which is only part of the error) can be demonstrated by Bland and Altman plots, but use of this technique is difficult to understand by the untrained audience, and therefore these plots are often not published in medical journals (11). Herein we introduce cumulative probability plots as a means of presenting radiographic progression scores that addresses the interpretation problems outlined above. A cumulative probability plot is a visual presentation of all observed data by plotting the observed cumulative propor- LANDEWÉ AND VAN DER HEIJDE tion (scores ranked from the lowest through the highest values, and presented as a cumulative proportion of all scores) against the variable’s actual value. Unlike descriptive summary statistics, cumulative probability plots include all individual data and enable visualization of the internal coherence of the data. Probability plots can be used to help the reader of a given report make a betterinformed judgment about radiographic progression in the patients studied. Probability plots Cumulative probability plots of radiographic change scores. The COBRA (Combinatietherapie Bij Reumatoı̈de Arthritis) trial was a 1-year randomized clinical trial that compared the effects of a treatment strategy with combination therapy (prednisolone, methotrexate, and sulfasalazine) versus monotherapy (sulfasalazine only) in 135 patients with RA (12). Table 1 Table 1. Frequency distribution of radiographic progression scores in 135 patients who participated in the COBRA trial Patients with this score or below Progression score No. of patients Cumulative frequency Cumulative probability, % 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ⬎30 22 14 7 14 12 6 6 4 2 1 3 1 2 3 3 4 3 6 1 1 1 1 5 1 0 0 1 1 1 0 2 7 22 36 43 57 69 75 81 85 87 88 91 92 94 97 100 104 107 113 114 115 116 117 122 123 123 123 124 125 126 126 128 135 16.3 26.7 31.9 42.2 51.1 55.6 60.0 63.0 64.4 65.2 67.4 68.1 69.6 71.9 74.1 77.0 79.3 83.7 84.4 85.2 85.9 86.7 90.4 91.1 91.1 91.1 91.9 92.6 93.3 93.3 94.8 100 CUMULATIVE PROBABILITY PLOTS OF RADIOGRAPHIC PROGRESSION 701 Figure 1. Individual progression scores of 135 rheumatoid arthritis patients who participated in the COBRA trial. Data are presented by histogram (A), cumulative probability plot (B), and dot plot (C). See text for additional explanation. 702 summarizes all observed radiographic progression scores from the COBRA trial, which were defined as the difference between the total damage score (van der Heijde–modified Sharp score) (10) at the end of the trial and that at the start of the trial. Data were summarized by change score in 3 different ways: 1) number of patients with a particular score; 2) cumulative number of patients with a score less than or equal to that particular score (cumulative frequency); and 3) cumulative percentage of patients with a score less than or equal to that particular score (cumulative probability). Cumulative probability is the cumulative frequency expressed as a percentage of the total number of patients. Note that every patient contributes an equal part (1/135 ⫽ 0.0074, or 0.74%) to the cumulative probability. If data such as those shown in Table 1 are plotted in a graph, a bar chart such as that shown in Figure 1A can be created. Bar charts are useful to provide an impression about the type of distribution of the data, e.g., to determine whether the data are normally distributed (bell-shaped curve). The pattern in Figure 1A is a typical example of a set of radiographic progression scores: since the change scores with the highest frequencies lie to the left as compared with the scores of the normal distribution, such a distribution is called skewedto-the-left. If all separate cumulative probability values (xaxis) are plotted against all separate scores (n ⫽ 135) (y-axis), a probability plot is created (Figure 1B). It should be noted that it does not matter which of the 2 variables is plotted on the x-axis and which on the y-axis; both types of probability plots can be found in the literature. Every single change score (1 score per patient) is now plotted on the graph and represents a similar proportion of the cumulative probability (0.74% in the case of the COBRA trial’s 135 patients), so the density of dots is similar along the entire range of the x-axis. The curve is a typical example of a radiographic score distribution in which the radiographs were scored with knowledge of the time order. The lowest possible scores are 0 (truncated to 0), and it is obvious from the figure that the scores in the lower range (⬍10 Sharp units) occur far more frequently than those in the higher range. It is easy to see what proportion of patients has a change score of 0. High and very high scores occur sporadically and contribute only very minimally to the cumulative frequency, but they importantly determine the curvature of the graph, as well as the mean and the standard deviation. The median and the 25th and 75th percentiles can easily be derived from the probability plot by drawing a straight line from the corresponding LANDEWÉ AND VAN DER HEIJDE Figure 2. Cumulative probability plots of individual 2-year radiographic progression scores (in modified Stoke Ankylosing Spondylitis Spine Score [mod. SASSS] units) in 109 ankylosing spondylitis patients from the OASIS cohort. Each patient was scored twice by the same reader: once with concealed time order (circles) and once with open time order (triangles). (A circle and a triangle with similar cumulative probability do not necessarily represent the same patient.) percentile on the x-axis through the curve (Figure 1B). The matching progression scores can be read from the y-axis. Figure 1C shows a dot plot of the same radiographic data. Dot plots also include all separate scores. Probability plots as well as dot plots allow interpretation of the coherence of the data (irregularities, “jumps”), but it is impossible to directly interpret percentiles from a dot plot. Cumulative probability plots and measurement error. The distribution of radiographic scores obtained from studies in which radiographs are read with known time order and those with readings with concealed time order differ importantly. Medians and percentiles do not easily reflect these differences. Assuming that radiographs are scored with concealed time order and there is no “true progression” in the patients, the “true change score” would be 0 and every deviation from a change score of 0 is by definition considered to be random. Thus, this reflects random measurement error, which can be either negative or positive. In patients with “true progression,” measurement error may also be operative, but the signal will exceed the CUMULATIVE PROBABILITY PLOTS OF RADIOGRAPHIC PROGRESSION 703 Figure 3. Cumulative probability plot (A) and Bland and Altman plot (B) of individual 2-year radiographic progression scores (in modified Stoke Ankylosing Spondylitis Spine Score [mod. SASSS] units) in 109 ankylosing spondylitis patients from the OASIS cohort. Each patient was scored twice by 2 different readers, both of whom read the radiographs with concealed time order. (Circles and triangles in A represent reader 1 and reader 2, respectively; a circle and a triangle with similar cumulative probability do not necessarily represent the same patient. Each circle in B refers to the same patient scored by 2 readers, but 1 symbol may comprise more than 1 patient.) Arrow indicates an example of how actual progression scores cannot be easily and directly depicted in a Bland and Altman plot: the mean score of ⫹5 and the difference score of ⫺4 were derived from actual scores of ⫹3 by reader 1 and ⫹7 by reader 2. noise in some patients and not in others. Cumulative probability plots reveal at a glance the differences between readings with open and those with concealed time order. Figure 2 shows the probability plots from 2 different readings in 109 patients with AS who were part of the OASIS (Outcome Assessments in Ankylosing Spondylitis International Study) cohort (13): 1 reader scored with open time order and the other scored with concealed time order. The scorings reflect the change between the assessment at time 0 and that at 2 years. The most striking feature of the concealed time order data (Figure 2) is the occurrence of negative scores, which do not occur in the open time order data. Because these scorings incorporate a high number of scores of 0, this feature is not reflected appropriately by presenting the median values and 25th and 75th percentiles, which are similar with open time order and concealed time order readings. Another typical feature that is seen repeatedly in plots that compare scorings with open and concealed time order is that the curve from the open time order readings lies to the left of and above the curve from the concealed time order readings and shows somewhat higher scores. The best explanation for the phenomenon that reading with open time order tracks scores toward higher values is that readers anticipate progression and score accordingly, whereas they are likely to be more conservative if they do not know the true time order, especially in radiographs with minor changes. Because of measurement error, radiographs are usually read by 2 or more readers, as noted above. Cumulative probability plots can be used to visually depict interreader variability and to explore trends. Figure 3A shows the probability plot of change scores obtained by 2 independent readers who scored the same sets of radiographs of AS patients from the OASIS cohort (2-year progression scores) according to the modified SASSS. It is obvious at a glance that reader 1 assigned scores that were somewhat higher than those assigned by reader 2. Reader 1 saw some progression in a greater proportion of patients than did reader 2 (was more sensitive to change), but assigned negative scores in a smaller proportion than did reader 2 (sensitivity to 704 change was not at the cost of specificity here). As compared with reader 2, the entire curve of scores from reader 1 is to the left. As mentioned above, Bland and Altman plots can be used to assess agreement between readers. These plots present the difference in progression scores between 2 readers (on the y-axis) against the average of the progression scores assigned by the readers (on the x-axis). Figure 3B displays the same data as Figure 3A, but in the format of a Bland and Altman plot. Again, it is obvious that scores assigned by reader 1 were a little higher (represented by a mean negative difference between the readers (dotted line), but it is difficult to deduce additional information from this plot. What are the differences between probability plots and Bland and Altman plots? First, the actual progression scores can be easily and directly depicted in the probability plot. Additional inference is needed in order to obtain this information from a Bland and Altman plot. An example is the dot designated with an arrow in Figure 3B: the mean score of ⫹5 and the difference score of ⫺4 derive from an actual score of ⫹3 by reader 1 and of ⫹7 by reader 2. Second, in probability plots, unlike Bland and Altman plots, the scores by 2 readers for a particular value on the x-axis do not necessarily represent the same patient. Third, probability plots can simultaneously plot the scores by more than 2 readers, which is not possible with Bland and Altman plots. An advantage of probability plots is that they are appropriate for investigating the coherence of the data in the group, with presentation of the actual progression scores. It should be noted, however, that probability plots are not appropriate to quantify measurement error, which can be done by using the data from Bland and Altman plots. Therefore, the 2 types of plots give complementary information, and which of them to use, or whether to use both, depends on the data and the study question. Cumulative probability plots and clinical trials. A third application area for cumulative probability plots is the RCT. Radiographic progression is a pivotal outcome measure of many RCTs in RA and may become a key outcome measure in RCTs in AS. Probability plots can be used to visually compare the distributions of results in 2 (or more) treatment arms. Figure 4 shows the probability plots for the 2 treatment arms of the COBRA trial. COBRA combination therapy was shown to be significantly better than sulfasalazine monotherapy in slowing 1-year progression, as well as 5-year progression, of radiographic damage (14). The plots immediately LANDEWÉ AND VAN DER HEIJDE Figure 4. Cumulative probability plots of individual 1-year radiographic progression scores in 135 rheumatoid arthritis patients who participated in the COBRA trial (67 patients in the monotherapy group [circles] and 68 patients in the combination therapy group [triangles]). Cumulative probability was calculated per group. show that the treatment groups differed with respect to radiographic progression. In the COBRA trial the curve representing the combination therapy group lies closer to the x-axis than that representing the monotherapy group, along the entire range of change scores except for those close to or equal to 0. The latter represents the “bottom” effect inherent to distributions that are truncated to 0. It is also obvious that the distribution for the monotherapy group includes higher absolute change scores as compared with the distribution for the combination therapy group. Finally, the cumulative probability curves are not entirely “smooth,” and the space between the 2 curves, which is an indication of the treatment contrast, varies along the axis of cumulative probability. This irregularity is important if one realizes that binomial cutoff levels for radiographic progression are often (understandably) used to describe the magnitude of the treatment effect. The probability curves demonstrate that the choice of the cutoff level is relevant with regard to the magnitude of the treatment contrast. For example, if a cutoff level of 0 Sharp units is selected (every patient with a score ⬎0 is considered to have progression), there is progression in 80% of the patients in the combination group compared with 87% in the monotherapy group, resulting CUMULATIVE PROBABILITY PLOTS OF RADIOGRAPHIC PROGRESSION in a between-group contrast of only 7%. The choice of a cutoff level of 5 Sharp units, in contrast, would adjudicate progression to 31% and 58% in the combination group and monotherapy group, respectively, with a treatment contrast of 27%. As a consequence, an optimal cutoff level (i.e., one that provides the highest contrast) can easily be constructed by the investigator, as we have shown previously (8), but this can also easily be detected by viewing probability plots. Discussion The typical way of presenting radiographic change scores, by descriptive statistics such as medians and percentiles combined with means and standard deviations, gives rise to a loss of potentially relevant information. Probability plots can be used to visualize the phenomenon of measurement error or to explore differences in treatment outcome in clinical trials, and may provide much more information about the course of radiographic progression. Arguably the most important advantage of probability plots over conventional means of data presentation is that probability plots, unlike percentiles or box-and-whisker plots, clarify whether there is coherence of the data. Such coherence may add to the credibility of a group result compared with presenting it only as a median. Technical details, such as concealment of reading order and the subsidiary occurrence of negative scores, which may decisively influence the interpretation of results, can be easily visualized with probability plots, whereas this information can be easily missed if results are presented as medians and 25th/75th percentiles. Use of a cutoff level of 0 (or 0.5 if the average of 2 readers is used) is often inadequate for differentiating patients with and those without progression, an issue that we recently encountered in a meta-analysis on the efficacy of DMARDs in slowing radiographic progression (15). We have previously advocated the concept of the smallest detectable difference (SDD) beyond measurement error as a minimum cutoff level for distinguishing patients with and those without radiographic progression (9). The SDD level can be easily plotted in the probability curve, and the benefit of doing so is obvious. It is easy to see whether the SDD cutoff is a conservative one with respect to treatment difference, and the implications of different cutoff levels can be immediately discerned. Cumulative probability plots are an aid in explorative analysis. They certainly do not replace statistical testing, and should be used only as an adjunct to formal 705 hypothesis testing. However, they may provide useful information if a between-group difference in a comparative clinical trial does not appear to be statistically significant. They can help in interpreting Type II error as a possible cause for lack of a finding of statistical significance of a trend. Probability plots do also not replace Bland and Altman plots. The latter are useful in determining important sources of measurement error: interreader variability and systematic error. Probability plots of change scores aggregated from 2 or more readers do not provide insight into interreader variability; rather, they enable visualization of the entire level of measurement error. In summary, we propose cumulative probability plots as a new means to depict radiographic progression scores in reports of observational or methodologic studies and clinical trials. Probability plots may reveal additional and important information that is not provided by simply presenting medians and percentiles or box-andwhisker plots. We advocate this application of probability plots in reports of studies involving assessment of radiographic progression, in order to help readers better understand what has occurred in the study. REFERENCES 1. Plant MJ, Jones PW, Saklatvala J, Ollier WE, Dawes PT. Patterns of radiological progression in early rheumatoid arthritis: results of an 8 year prospective study. J Rheumatol 1998;25:417–26. 2. Wolfe F, Sharp JT. Radiographic outcome of recent-onset rheumatoid arthritis: a 19-year study of radiographic progression. Arthritis Rheum 1998;41:1571–82. 3. Sharp JT. Radiographic evaluation of the course of articular disease. Clin Rheum Dis 1983;9:541–57. 4. Larsen A, Dale K, Eek M. Radiographic evaluation of rheumatoid arthritis and related conditions by standard reference films. Acta Radiol 1977;18:481–91. 5. MacKay K, Mack C, Brophy S, Calin A. The Bath Ankylosing Spondylitis Radiology Index (BASRI): a new, validated approach to disease assessment. Arthritis Rheum 1998;41:2263–70. 6. MacKay K, Brophy S, Mack C, Calin A. Patterns of radiological axial involvement in 470 ankylosing spondylitis patients [abstract]. Arthritis Rheum1997;40 Suppl 9:S61. 7. Averns HL, Oxtoby J, Taylor HG, Jones PW, Dziedzic K, Dawes PT. Radiological outcome in ankylosing spondylitis: use of the Stoke Ankylosing Spondylitis Spine Score (SASSS). Br J Rheumatol 1996;35:73–6. 8. Landewé R, Boers M, van der Heijde D. How to interpret radiological progression in randomized clinical trials? [editorial]. Rheumatology (Oxford) 2003;42:2–5. 9. Van der Heijde D, Simon L, Smolen J, Strand V, Sharp J, Boers M, et al. How to report radiographic data in randomized clinical trials in rheumatoid arthritis: guidelines from a roundtable discussion. Arthritis Rheum 2002;47:215–8. 10. Van der Heijde D, Boonen A, Boers M, Kostense P, van der Linden S. Reading radiographs in chronological order, in pairs or as single films has important implications for the discriminative 706 power of rheumatoid arthritis clinical trials. Rheumatology (Oxford) 1999;38:1213–20. 11. Lassere M, Boers M, van der Heijde D, Boonen A, Edmonds J, Saudan A, et al. Smallest detectable difference in radiological progression. J Rheumatol 1999;26:731–9. 12. Boers M, Verhoeven AC, Markusse HM, van de Laar MA, Westhovens R, van Denderen JC, et al. Randomised comparison of combined step-down prednisolone, methotrexate and sulphasalazine with sulphasalazine alone in early rheumatoid arthritis. Lancet 1997;350:309–18. Erratum in: Lancet 1998;351:220. 13. Spoorenberg A, de Vlam K, van der Heijde D, de Klerk E, LANDEWÉ AND VAN DER HEIJDE Dougados M, Mielants H, et al. Radiological scoring methods in ankylosing spondylitis: reliability and sensitivity to change over one year. J Rheumatol 1999;26:997–1002. 14. Landewé RBM, Boers M, Verhoeven AC, Westhovens R, van de Laar MAFJ, Markusse HM, et al. COBRA combination therapy in patients with early rheumatoid arthritis: long-term structural benefits of a brief intervention. Arthritis Rheum 2002;46:347–56. 15. Jones G, Halbert J, Crotty M, Shanahan EM, Batterham M, Ahern M. The effect of treatment on radiological progression in rheumatoid arthritis: a systematic review of randomized placebocontrolled trials. Rheumatology (Oxford) 2003;42:6–13.