Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Ординатура / Офтальмология / Английские материалы / Using and Understanding Medical Statistics_Matthews, Farewell_2007

.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
3.03 Mб
Скачать

Table 18.3. The factorial design for Stage II patients in the third National Wilms’ Tumor Study

 

No radiation

2,000 rads

 

 

 

Vincristine and actinomycin-D

Regimen W

Regimen X

Vincristine, actinomycin-D and adriamycin

Regimen Y

Regimen Z

 

 

 

18.6. Factorial Designs

The majority of clinical trials are designed, primarily, to answer a single question. This is often an unnecessary restriction on the design of a trial, especially for diseases which require multi-modal therapy.

For example, when the third National Wilms’ Tumor Study was being designed, there were two questions of interest concerning Stage II, favorable histology, patients. One question concerned the chemotherapy comparison of twoand three-drug regimens which was mentioned in §18.4; the other was whether post-operative radiation was necessary for these patients. Since the number of cases of Wilms’ tumor is small and the relapse-free survival rate is very high, two separate trials to address these questions were not feasible.

Both questions can be answered, however, in a factorial design. The schematic layout for the design is shown in table 18.3. Patients are randomized among four regimens, and the sample size of the study need not be much larger than that required to answer either question separately. The radiation question is addressed by comparing regimen W to regimen X and regimen Y to regimen Z. The chemotherapy comparison is based on W versus Y and X versus Z. With this design, it is also possible to detect a synergistic (or antagonistic) interaction between the two modalities, although if such an effect is suspected, it might be necessary to increase the sample size somewhat.

We will not describe the details of analyzing such studies; nevertheless, factorial designs pose no major analytical problems. Therefore, if the flexibility of such a trial design is attractive, researchers should not be reluctant to consider using designs of this type.

18.7. Repeated Significance Testing

In chapter 16, we discussed the problem of multiple comparisons. A related problem in the analysis of clinical trials is known as repeated significance testing.

18 The Design of Clinical Trials

228

Table 18.4. The overall probability of a significant test with repeated hypothesis testing [50]

Nominal

Number of repeat tests

 

 

 

 

 

 

significance

 

 

 

 

 

 

 

 

 

1

2

3

4

5

10

25

50

200

level

 

 

 

 

 

 

 

 

 

 

0.01

0.01

0.018

0.024

0.029

0.033

0.047

0.070

0.088

0.126

0.05

0.05

0.083

0.107

0.126

0.142

0.193

0.266

0.320

0.424

0.10

0.10

0.160

0.202

0.234

0.260

0.342

0.449

0.524

0.652

 

 

 

 

 

 

 

 

 

 

Reprinted by permission of the publisher.

When a clinical trial is ongoing, it is common, and ethically necessary, to prepare interim analyses of the accrued data. If one treatment can be shown to be superior, then it is necessary to stop the trial so that all patients may receive the optimal treatment. Unfortunately, the more frequently the study data are examined, the more likely it is that a ‘statistically significant’ result will be observed.

Table 18.4, taken from McPherson [50], illustrates this effect by showing the overall probability of observing a significant result at three nominal significance levels when a test is repeated differing numbers of times. Although this table is based on ‘some fairly rigid technical assumptions’ and may not be directly relevant to all clinical trials, it illustrates clearly that multiple tests at the same nominal significance level can be very misleading. For example, if we conduct ten analyses which test for a treatment difference at the nominal significance level of 0.05, the chance of falsely detecting a treatment difference is nearly 0.20, not 0.05.

There is a fair amount of statistical literature concerning formal trial designs which adjust for the effect of repeated significance testing; this area of research is known as sequential analysis. To a large extent, this predominantly theoretical work has had little effect on the actual design of medical trials. We believe that this is the case because much of the formalism does not accurately reflect the conditions under which many medical trials are conducted. A formal significance test is often one of many components in the decision to continue or stop a trial. Nevertheless, some of the more recent research in sequential analysis has greater potential for application and is affecting the design of clinical trials.

The main purpose of this section has been to make the reader aware of a frequently occurring problem in medical studies. A clinical trial should not be

Repeated Significance Testing

229

stopped as soon as a significant result at the 5% level has been detected. When data are constantly re-examined, and updated, the advice of a statistician should be sought before any major decisions are made on the basis of an analysis which ignores the effect of repeated statistical testing.

18.8. Sequential Analysis

The conventional view of a clinical trial can be regarded as a ‘fixed sample design’. This means that a sample size is determined at the planning stage, and that the trial results are analyzed once the specified sample size has been achieved. However, as the previous section has indicated, the usual monitoring of a clinical trial often makes it ‘de facto’ a sequential experiment with repeated analyses over time. In this section, we give a brief introduction to some actual sequential designs. Because of technical details which we choose not to discuss, we recommend that a statistician be consulted before initiating a sequential trial. Nonetheless, we hope this section will provide useful background material for interested readers.

Most sequential designs start with the supposition that the primary comparison of the clinical trial can be represented by a test statistic. We shall represent this statistic by Z to suggest that, under the null hypothesis of no treatment difference, it is usually normally distributed with mean 0 and variance 1. For example, Z might be the usual ratio of the estimated regression coefficient associated with treatment to its estimated standard error. At any point in time during the trial, Z can be calculated.

The approach to sequential design advocated by Whitehead [51] is to consider what we might expect to see, if the null hypothesis is true, and if Z was observed or calculated continuously over time. While this is clearly impractical, it is an approach which leads to reasonable procedures that can be slightly modified to reflect the usual monitoring strategy. The essential characteristic of the design ensures that if there is no treatment difference, the overall probability, for the complete trial, of concluding that the data are not consistent with the null hypothesis is a specified significance level . The value represented by would often be the customary 5% level of significance. A decision that the data are inconsistent with the null hypothesis is frequently referred to as ‘rejecting the null hypothesis’. Thus, in a sequential design of the type described above, the probability of rejecting the null hypothesis, on the basis of Z, sometime during the trial, is equal to . By way of comparison, in a trial of fixed sample design a single significance test at level is performed at the end of the trial. Since the technical details of Whitehead’s approach are beyond the scope of this book, we will not discuss it further.

18 The Design of Clinical Trials

230

A second approach, known as group sequential designs, acknowledges that analyses will usually take place at specified times and presents a design based on a plan to perform a fixed number of analyses, say K, at distinct times. Group sequential designs which parallel the continuous procedures of Whitehead [51] choose a testing significance level for the jth test which is the same for all tests and such that the overall probability of rejecting the null hypothesis, if it is true, is equal to . Thus, for example, a design which involved four planned analyses and an overall significance level of 5% would perform a significance test at each analysis at a testing significance level of 0.018.

We are sympathetic to the arguments advanced by Fleming et al. [52] that treatment differences observed in the early stages of a trial may occur for a variety of reasons, and that the primary purpose of a sequential design is to protect against unexpectedly large treatment differences. Therefore, Fleming et al. advocate using group sequential designs which preserve the sensitivity to lateoccurring survival differences that a fixed sample design based on a single analysis would have. In addition, they argue that if the final analysis of a group sequential design is reached, then one would like to proceed, as much as possible, as if the preliminary analyses had not been done and a fixed sample design had been used.

To achieve these ends, Fleming et al. present designs in which the level of significance at which an intermediate analysis is performed increases as the trial progresses, and such that the testing level of significance for the final analysis is close to the overall level of . Their proposal fulfills the ethical requirement of protecting patients while not creating substantial additional difficulties in the data analysis. The designs are characterized by K, the number of planned analyses, , the overall significance level, and by , the probability of terminating the trial early if the null hypothesis is true. The fraction is, in some sense, the proportion of the overall probability of rejecting the null hypothesis which is used up prior to the final analysis. If we denote the testing levels of significance for the K analyses by 1, 2, ..., K then specifying is equivalent to specifying the ratio of K and , i.e., R = K/ . This ratio indicates how close to the overall level the final analysis is to be performed, and reflects the effect which the sequential nature of the design is allowed to have.

Table 18.5 presents a subset of the designs described in Fleming et al. [52]. The table covers the cases specified by = 0.05, K = 2, 3, 4, and 5 and = 0.1, 0.3 and 0.5. For example, if three analyses were planned and it was important to keep the ratio R high, i.e., = 0.1 so that R = 0.04831/0.05 = 0.97, then the testing significance levels would be 1 = 0.00250, 2 = 0.00296 and 3 = 0.04831. On the other hand, if a more liberal stopping criterion was desirable, the design with = 0.5 would result in testing significance levels of 1 = 0.01250, 2 = 0.01606 and 3 = 0.03558 with R = 0.03558/0.05 = 0.71.

Sequential Analysis

231

Table 18.5. Testing significance levels 1, ..., 5 for some group sequential designs

K

 

1

2

3

4

5

 

 

 

 

 

 

 

2

0.1

0.00500

0.04806

 

0.3

0.01500

0.04177

 

0.5

0.02500

0.03355

 

 

 

 

 

 

 

3

0.1

0.00250

0.00296

0.04831

 

0.3

0.00750

0.00936

0.04292

 

0.5

0.01250

0.01606

0.03558

 

 

 

 

 

 

 

4

0.1

0.00167

0.00194

0.00233

0.04838

 

0.3

0.00500

0.00612

0.00753

0.04342

 

0.5

0.00833

0.01047

0.01306

0.03660

 

 

 

 

 

 

 

5

0.1

0.00128

0.00144

0.00164

0.00219

0.04806

 

0.3

0.00379

0.00447

0.00527

0.00642

0.04319

 

0.5

0.00634

0.00776

0.00931

0.01134

0.03691

The overall level of significance is 0.05, and 0.05 is the probability of terminating the trial early, if the null hypothesis is true, at any of the K analyses.

Adapted from Fleming et al. [52]; it appears here with the kind permission of the publisher.

Table 18.6. Anticipated results from the use of a sequential design proposed by Fleming et al. [52] for a clinical trial of extensive stage small-cell lung cancer

Date

 

Total number of patients

Number

Testing

Log-rank

 

 

randomized to

of deaths

significance

p-value

 

 

 

 

observed

level for early

observed

 

 

regimen A

regimen B

 

 

 

termination

 

 

 

 

 

 

 

 

 

 

 

 

 

9/12/77

19

17

15

0.007

0.013

5/05/78

30

32

30

0.008

0.214

11/12/78

32

33

45

0.010

0.701

7/15/79

32

33

60

0.040

0.785

 

 

 

 

 

 

 

Adapted from Fleming et al. [52]; it appears here with the kind permission of the publisher.

18 The Design of Clinical Trials

232

Table 18.6 is abstracted from Fleming et al. and reports the results of a clinical trial of extensive stage small-cell lung cancer. Two chemotherapy regimens, denoted by A and B, were to be compared. Calculations based on a fixed sample design to compare death rates using the log-rank test suggest that the study would require about 60 deaths. The nature of these calculations is outlined in chapter 17. If we assume that K = 4 log-rank analyses are planned during the trial, one every 15 deaths, and if we also require R = 4/ = 0.8, then the four testing significance levels which result are 1 = 0.007, 2 = 0.008, 3 = 0.010 and 4 = 0.040.

From table 18.6 it can be seen that although there was a relatively large treatment difference early in the trial, this difference would not have been sufficient to stop the trial. Moreover, by the end of the trial no treatment difference was apparent.

This section is not intended to be a comprehensive treatment of the topic of sequential designs. Additional study of the subject, and consultation with a statistician, would be essential before embarking on a clinical trial which involves a sequential design. However, we do feel that the designs proposed by Fleming et al., which we have described, are consistent with the usual practice of clinical trials. Therefore, they may be of interest to some readers.

Sequential Analysis

233

19

U U U U U U U U U U U U U U U U U U U U U U U U U U U

Further Comments Regarding

Clinical Trials

19.1. Introduction

Chapter 18 provides a basic introduction to clinical trials. While we hope that the discussion there was realistic, it necessarily adopted a fairly simple view of a clinical trial. In this chapter, we raise several issues that are somewhat more complex, but which we feel it worthwhile to bring to the reader’s attention.

19.2. Surrogate Endpoints

For diseases that involve a lengthy delay between the initiation of treatment and the determination of its outcome, keeping the trial as short as possible is often an important goal. One way to achieve this is to base the analysis of the trial on a ‘surrogate endpoint’ or ‘surrogate marker’ which is thought to be an early indicator of outcome. For example, in trials designed to evaluate the ability of an experimental drug to delay the progression of HIV disease, investigators might consider using the level of CD4+ lymphocytes or a measure of viral load as a surrogate for the ‘harder’ clinical endpoints such as the onset of AIDS or death. The underlying logic is that if decreased levels of CD4+ cells are associated with increased risk of an AIDS diagnosis, then a consistently depressed CD4+ count could be regarded as a relevant clinical endpoint on which to base analyses. In drug regulation, there are obvious advantages to patients and pharmaceutical companies if biomarkers can be used as valid surrogates of the ultimate clinical benefit of treatments and thus shorten the time span of the trial. Likewise, the use of surrogate endpoints may be considered

when the response of primary interest is difficult to observe, expensive to measure, or involves a dangerous invasive procedure.

Prentice [53] defines a valid surrogate endpoint as ‘a response variable for which a test of the null hypothesis of no relationship to the treatment groups under comparison is also a valid test of the corresponding null hypothesis based on the true endpoint.’ This definition is very useful. For example, it identifies that a surrogate endpoint for one clinical trial may not be useful for another clinical trial involving the same primary endpoint, but different treatments. Nevertheless, the definition is a fairly restrictive one that is rarely satisfied in practice.

The dangers involved in using even surrogate endpoints that appear to be sensible have been highlighted by Fleming [54], who discussed two trials. Ventricular arrhythmias are a risk factor for subsequent sudden death in individuals who have had a recent myocardial infarction. The drugs encainide and flecainide were widely used in treatment because their antiarrhythmic properties had already been established. Nevertheless, the Cardiac Arrhythmia Suppression Trial was initiated and, based on 2,000 randomized patients, established that the death rate associated with using these drugs was nearly three times the corresponding rate for placebo controls.

Fleming’s second example is that of a trial concerning the role of -inter- feron in the treatment of chronic granulomatous disease (CGD). Phagocytes from CGD patients ingest microorganisms normally but fail to kill them due to an inability to generate a respiratory burst that depends on the production of superoxide and other toxic oxygen metabolites. The disease results in a risk of recurrent serious infections. Since -interferon was thought to be a macro- phage-activating factor that could restore superoxide anion production and bacterial killing by phagocytes in CGD patients, a trial was designed initially that would involve one month of treatment and would use as response variables endpoints that depended on the two surrogate markers superoxide production and bacterial killing. Ultimately, the investigators decided to administer -interferon for a year in a trial that measured the occurrence of serious infections as the outcome. This trial established the effectiveness of -inter- feron in reducing the rate of serious infection. However, when the biological surrogate marker data that had been proposed as outcome variables in the initial design were analyzed, treatment with -interferon had no apparent effect. Thus, a trial based on these two surrogate markers as endpoints would have failed to detect an effective treatment.

These two examples cited by Fleming, demonstrating the potential for both false positive and false negative results in conjunction with the use of surrogate endpoints, provide a useful caution against the uncritical use of outcomes other than those of primary clinical interest. Careful evaluation of sur-

Surrogate Endpoints

235

rogate endpoints should precede their use in a clinical trial; Burzykowski et al. [55] describe methods for such an assessment. Note, in particular, that a strong correlation between the surrogate endpoint and a true endpoint does not ensure the surrogate will be good.

There is some potential, perhaps, in the joint use of surrogate markers and primary outcome variables but initial investigations into appropriate methodology have not been as promising as had been hoped. We encourage readers who are considering the use of surrogate endpoints to keep the following in mind: the question asked is the one that gets answered!

19.3. Active Control or Equivalence Trials

When an effective standard therapy exists, a new experimental treatment may be investigated because of reduced toxicity, lower cost or some other characteristic that would make it, the experimental therapy, the treatment of choice iƒ its efficacy was equivalent to or better than the standard. The design of this type of trial – an active control or equivalence trial – cannot be based on the usual significance test because, as is the case for all significance tests, failure to reject a null hypothesis of no treatment difference in a clinical trial does not establish the equivalence of the treatments thus compared. A large significance level associated with a test of this null hypothesis indicates that the data gathered during the trial are consistent with the null hypothesis. However, in order to learn what size of treatment effects remain plausible in light of the data, it is necessary to look at confidence intervals.

Fleming [56] advocates the use of confidence intervals to analyze active control trials. First, he identifies a point, denoted by e, that represents overall therapeutic equivalence of the treatments under study. This point is defined in terms of an efficacy outcome, but its value will depend on other considerations such as toxicity and cost. Next, Fleming specifies a quantity that represents the departure from e that would lead to the conclusion that one treatment under study was superior to the other; this quantity is represented by the Greek letter. Finally, Fleming defines the relative efficacy of a placebo to the standard therapy to be ; the value of is assumed to be known from previous studies.

Although specifying e, , and clearly involves considerable subjectivity, these three characteristics together provide a framework for analyzing an equivalence trial. The precise nature of this structure is illustrated in figure 19.1. The horizontal axis in the figure indicates the relative efficacy of the experimental treatment to the active control and could represent a mean difference, an odds ratio, a relative risk, or any other appropriate, comparative measure. For example, if the lower limit of a 95% confidence interval for the relative

19 Further Comments Regarding Clinical Trials

236

Fig. 19.1. Key aspects of the framework proposed by Fleming [54] for analyzing an equivalence trial.

efficacy of the experimental treatment to the active control exceeds e – , and it also exceeds , then the experimental treatment will be considered equivalent or superior to the active control. In this case, the evidence indicates that the experimental treatment is not inferior to the standard and provides some benefit compared to no treatment whatsoever, i.e., placebo control. Similarly, if the upper limit of a 95% confidence interval for the relative efficacy of the experimental treatment to the active control is less than e + , the hypothesis that the experimental therapy is superior will be rejected. If the lower limit exceeds e – and the upper limit is less than e + , the two treatments are considered equivalent. In many situations there may be no particular interest in equivalence per se, and it is simply that the lower limit of the confidence interval exceeds e – that is of interest. In this case, the terminology non-inferior- ity trial is sometimes used instead of equivalence trial.

Note that the logic underlying sample size calculations for equivalence trials is somewhat different from that outlined in chapter 17, and a statistician or specialized references should be consulted. Essentially, the sample size has to be sufficient to ensure that the width of the confidence interval for the relative efficacy is small enough to allow values below e – to be excluded with high probability if the relative efficacy is at least e.

Adopting a sequential design can often improve an equivalence clinical trial. As well, it may be desirable to incorporate other outcomes, for example toxicity, into the formal analysis of the trial. Methodology to facilitate these extensions has been developed. Since the intent of this section was solely to introduce the different sort of logic that must be considered when questions of equivalence are entertained, we will not attempt to discuss sequential methods for equivalence trials.

19.4. Other Designs

In our discussion of the design of clinical trials so far, we have assumed the most common situation, which is sometimes termed a parallel group design. In such trials, patients are individually randomized to two or more treat-

Other Designss

237

Соседние файлы в папке Английские материалы