arise, however, if an investigator compared a mortality rate for a recent year against the corresponding rate for each of a series of previous years.
The main role of time in multiple comparisons, however, usually occurs as a serial problem in timing of outcome measurements, discussed next, or in the sequential accrual of data, discussed in Section 25.3.
25.1.2.6 Timing of Outcome Measurement — If the outcome is a binary event, such as death or recurrences of myocardial infarction, the analysis is usually managed with the “life-table” methods, discussed in Chapter 22, that focus on time to occurrence of the “failure” event. In many other studies, however, the outcome is expressed with an ordinal or dimensional variable, such as level of pain or blood pressure, for which average magnitude (or change) in two treated groups can be checked at different serial times after onset of treatment. For pain, the intervals might be at 15 min, 30 min, 60 min, and 90 min; for blood pressure, they might be at 1, 2, 4, 6, and 8 weeks.
If stochastic significance is present at just one or two of these intervals, but not at the others, questions can arise about how (if at all) to adjust the α level for the multiple comparisons.
25.1.3Analytic Sources
Beyond all the opportunities just cited in the architecture of the research, multiple comparisons can also be produced as a purely statistical activity when the data are analyzed. One opportunity comes when the same comparison receives alternative stochastic tests. Another opportunity arises during a “stepped” sequence for multivariable analyses.
25.1.3.1Alternative Stochastic Procedures — A multi-comparison issue that seldom receives statistical attention occurs when the same hypothesis in the same set of data is tested with different stochastic procedures. For example, a set of dimensional results in two groups can be contrasted stochastically with a t test, a Wilcoxon-Mann-Whitney rank test, or a Pitman-Welch permutation test.
[One-tailed and two-tailed interpretations cannot be regarded as more than one comparison, because they use different stochastic hypotheses. The null hypothesis would be ∆ = 0 for two tails, and ∆ Š 0
(or ð 0) for one tail.]
If the alternative stochastic tests do not all lead to the same conclusion for “significance,” the data analysts usually choose the results they like. This problem is probably ignored in most statistical discussions because it arises not from random chance, but from the different mathematical strategies used to evaluate the role of random chance.
25.1.3.2Sequential Stepping in Multivariable Analysis — A more “legitimate” randomchance problem occurs in multivariable regression where the outcome event is a single dependent variable that is associated with different combinations of individual “independent” variables in sequentially stepped analyses. The stepping can go in an upward (or “forward”) direction, starting with one independent variable and progressively adding others, or in a downward (or “backward”) direction, starting with all the available independent variables, and progressively removing them one at a time. [The process
can also go back and forth in a “stepwise” (or “zig-zag”) arrangement].
Despite occasional discussion,13 no agreement has been reached on whether the α level should be
adjusted for the multiple inferences. Consequently, the problem is usually ignored, and the analysts then apply the same preset level of α without alteration during each of the successive analyses.
25.2 Management of Multiple Comparisons
The management of multiple comparisons is a complex and controversial issue. The proposals range from a purely stochastic “correction,” to a “benign neglect” that imposes no adjustment at all, to decisions based on architectural and other scientific principles.
The many mathematical proposals offered for prophylactic and remedial management of the “multiple inference” problem have enriched the world of statistical eponyms with such names as Bonferroni,14 Duncan,15 Dunn,16 Dunnett,17 Newman-Keuls,18,19Scheffe,20 and (of course) Tukey.21,22 In almost all of
the proposed methods, the overall level of α
is “penalized” by reduction to a smaller α′ for the individual
comparisons (or tests). When a smaller α′
is used for individual tests, the overall level of α can be kept
at the desired boundary for the total set of comparisons. The diverse proposals differ in the arrangements used for planning the multiple comparisons and for setting the penalties.
25.2.1.1Bonferroni Correction — Of the diverse penalty proposals, the most obvious and easy
to understand is called the
Bonferroni correction.14 For k comparisons, the level of α for each test is reduced
to α′ = α /k. Thus, with 6
comparisons, each would be called stochastically significant only if its P value
were below α′ = .05/6 = .00833. This tactic would make 1 − α′ become 1 − .00833 = .99167 for each comparison and (1 − α′ )k would be (.99167)6 = .951 for the group. The overall α = 1 – .951 = .049 would remain close to the desired level of .05.
For interpretation, each of the multiple Pi values is multiplied by k (i.e., the number of comparisons) and the “Bonferronied” P value is compared against the original α . For example, if 8 comparisons were done, a result that obtains P = .023 would be evaluated as though P = .184. If decisions are made with confidence intervals rather than P values, the confidence interval is constructed with α /k rather than α .
Thus, for 8 comparisons at α = .05, the value of Zα selected for the confidence interval would be Z .00625 rather than Z.05. The larger adjusted value of Z would produce a larger confidence interval and a reduced
chance of excluding the null value of the stochastic hypothesis.
In recent years, the Bonferroni procedure has been criticized for having low power, i.e., truly “signif - icant” results may be declared “nonsignificant.” Of alternative methods proposed by Holm,23,24 Hochberg,25 and Simes,26 the Holm procedure, which requires arranging the Pi values in increasing order, has been hailed27 as “simple to calculate, … universally valid, … more powerful than the Bonferroni procedure,” and a preferable “first-line choice for distribution-free multiple comparisons.” Nevertheless, after examining the operating characteristics of 17 methods for doing the correction, Brown and Russell28 concluded that “the only guaranteed methods to correct for multiplicity are the Bonferroni method and (the Holm) step-down analogue.”
Although each of the many multiple-comparison proposals has its own special mathematical virtues, the Bonferroni correction has the scientific advantage of remarkable simplicity. It is still probably the best correction to know, if you plan to know only one.
25.2.1.2 Other Proposals — Unlike the Bonferroni and other tactics just cited, which can be applied in almost any situation, most other proposals29,30 for managing the multiple-comparison problem are “tailored” to the particular structure of the research. For example, when treatments A, B, C, and D are tested simultaneously in the same trial, the conventional statistical analysis might begin with an analysis of variance (discussed later in Chapter 29), which first checks for “overall significance” among the four treatments. The stochastic adjustments thereafter are aimed at the pairwise comparisons of individual treatments.
If accruing data for two treatments are checked at periodic intervals in an ongoing randomized trial, the stochastic adjustments are set at lower levels of α′ for each of the sequential decisions, so that α will have an appropriate value when the trial has finished. The choices proposed for the sequentially lowered α′ levels have brought an additional set of eponyms, which are discussed later in Section 25.3.3.3.
Among other purely stochastic approaches, Westfall and Young31 have developed a compendium of bootstrap correction methods. Another strategy32 is to split the data into two parts, of which the first is used to explore multiple hypotheses. Those that are “significant” are then tested in the second part; and attention thereafter is given to only the hypotheses that “passed” both tests.
25.2.1.3Problems of Intercorrelation — When multiple factors are individually checked for their impact on an outcome event, some of them may be stochastically significant by chance alone, but their actual impact may be altered by correlations among the diverse factors. An example of this problem was noted3 for intercorrelations among such baseline “risk factors” as blood pressure, weight, serum cholesterol, age, and pulse rate among men who did or did not develop coronary disease in the famous Framingham study.
The suitable adjustment of intervariable correlations is more of a descriptive than a stochastic challenge, however, and is managed with multivariable analytic methods. The main intervariable problems arise, as noted in Section 25.1.3.2, when multiple testing is done as individual variables are added or removed incrementally during a “stepping” process.
25.2.1.4Bayesian Methods — All of the challenges of interpreting and adjusting P values can be avoided by using the methods of Bayesian inference. For each relationship, a prior probability is specified beforehand. It then becomes transformed, with the likelihood ratio determined from the observed data, into a posterior probability that reflects belief in the observed relationship, given the data. Adjustments are not needed for multiple-null-hypothesis P values, because each relationship has its own prior probability.
Appealing as the Bayesian approach may be, it has one supreme disadvantage. The prior probabilities cannot be established beforehand for relationships that emerge as “significant” only in conventional P values obtained after the data have been explored. A second problem, which may or may not be a disadvantage, is that the selection of prior probabilities, even when possible, will depend more on scientific anticipations (discussed later in Section 25.2.4.) than on purely mathematical estimations.
25.2.2“Permissive Neglect”
At the opposite extreme of management, the stochastic adjustments are replaced by a “permissiveneglect” argument that “no adjustments are needed for multiple comparisons.”33 Perhaps the simplest way of stating this argument is that a stochastic test is intended to demonstrate numerical stability of the results. If not stable, they need receive no further attention; if stable, their further evaluation depends on the scientific context and connotations. The stochastic criterion for stability, however, should not be altered according to whatever additional comparisons may have occurred. If the number of additional comparisons is really pertinent, perhaps “an investigator should control his ‘career-wise’ alpha level, or … all investigators should agree to control the ‘discipline-wise’ level.”32
Proponents of the permissive-neglect argument may dismiss not only the stochastic but also the scientific approach to multiple comparisons. According to one proposal,34 the investigator’s “perspective” in deciding what data to collect and analyze is “irrelevant to assessing the validity of the product.” The “motivations for including the specific items in the study and conducting the analyses has [sic] no independent relation to the quality of the data generated.”
At the extreme of the anti-multiple-comparison-correction argument is the claim that hypotheses generated from the data need not be distinguished from those tested with the data.34,35 Cole has proposed a theoretical hypothesis-generating machine,35 with which “all possible cause-effect hypotheses have been generated,” so that all subsequent activities can be regarded as hypothesis testing rather than generating. This approach would allow epidemiologists to use a counterpart of the “Texas sharpshooter strategy” in which someone fires a shot at a barn and then draws a target around the site where the bullet hit.
25.2.3Scientific Validity
In contrast to the mathematical rigidity of the purely stochastic approach and the nihilistic permissiveness of the no-correction approach, multiple comparisons can be evaluated with principles, discussed elsewhere36 and earlier in Section 11.10.3, that focus on scientific rather than statistical inference. The principles refer to the internal validity of the comparisons or associations examined in the data, and to their external validity when applied beyond the particular persons who were investigated.
For internal validity, the architectural components of the research should be free from four types of biased comparison: susceptibility bias in the baseline states, performance bias in the maneuvers, detection bias in the outcome events, and transfer bias in the collection of groups and data. For external validity, the groups and data should be suitable for the derived conclusion, and for the observed results to be extrapolated to the outside world beyond the special conditions of the study. For scientific decisions, these internal and external evaluations should always take place after (and preferably before) any statistical conclusions have been formed. If scientific validity is dubious, the statistical results can often be dismissed without further attention.
For example, when multiple-comparison analyses lead to highly publicized reports that pizza protects against prostate cancer whereas hot dogs raise the risk of brain cancer, a first step might be to determine whether the surveillance and monitoring that detected those cancers were carried out similarly in the groups who are “exposed” or “non-exposed” to the ingestion of pizza or hot dogs. In most such reports, the investigators have given no attention to the problem of “detection bias.”
Without resorting to architectural appraisals, scientists who are familiar with real-world medical phenomena might even use a simple analog of Bayesian inference called “clinical common sense.” For example, after the erroneous statistical association that led to coffee being widely publicized as a cause of pancreatic cancer, a distinguished Canadian scientist37 told me, “I knew it couldn’t be true. If it were, pancreatic cancer would be a major problem in Brazil.” Analogous common sense might have been used to dismiss the antior pro-carcinogenic assertions about pizza and hot dogs, and to avoid some of the other epidemiologic embarassments produced by erroneous but highly publicized accusations about the “menace of daily life.”38
25.2.4Specification of Scientific Hypotheses
An additional scientific approach to the multiple-comparison problem involves appraising the timing and specificity with which the statistical hypotheses were formulated. The choice and severity of a stochastic (or other) penalty might depend on the way those hypotheses were scientifically articulated before the analyses began. This approach has been advocated even in statistical publications, with the recommendation that no corrections are needed for multiplicity when “a select number of important well-defined clinical questions are specified at the design.”39
If the scientific hypotheses are stipulated in advance when the research is planned, the main mathematical challenges are to choose oneor two-tailed test directions, magnitudes for the boundaries of δ and ζ , and pertinent stochastic levels for α and β . In multiple-comparison problems, however, the scientific hypotheses may have been stipulated, vaguely anticipated, uncertain, or unknown before the data were statistically analyzed. If suitable attention is given to these previous specifications for the scientific hypothesis, the multiple-comparison problem might be effectively managed with substantive, rather than statistical, decisions.
25.2.4.1 Stipulated — Suppose the investigator wants to compare the virtues of a new analgesic agent, Excellitol, against a standard active agent, such as aspirin. To get approval from a regulatory agency, Excellitol must also be compared against placebo. When the randomized trial is designed with three “arms” — Excellitol, aspirin, and placebo — the main goal is to show that Excellitol is better than aspirin or at least as good. In this trial, the placebo “arm” does not really have a therapeutic role; its main job is scientific, to avoid the problem of interpretation if two allegedly active analgesic agents seem to have similar effects. The similar results may have occurred if the active agents were not adequately challenged because the patients under study had pain that was either too severe or too mild. A superiority to placebo will indicate that both active agents were indeed efficacious.
Many data analysts would readily agree that previously stipulated hypotheses should be tested without a mathematical penalty. Instead of the “three-arm” randomized trial just discussed, the investigators might have done three separate trials, testing Excellitol vs. aspirin in one study, aspirin vs. placebo in another, and Excellitol vs. placebo in a third. Because the results of each of these three studies would be appraised with the customary levels of α , a penalty imposed for research done with an efficient design would seem to be an excessive act of pedantry.
Similarly, when investigators check multiple post-therapeutic outcomes to explain a significant main result, no penalties seem necessary or desirable. For example, if Treatment A achieved significant improvement in the main global rating, various other individual manifestations — such as pain, mobility, or dyspnea — may then be checked to help explain other significant distinctions in the compared treatments. In these explanatory contrasts, the main role of the stochastic hypothesis is to confirm numerical stability, not to test new causal concepts.
In another situation, if cogent subgroups are defined by well-established and well-documented confounders, the examination of their results would represent the equivalent of previously stipulated hypoth - eses. For example, if a new chemotherapy significantly prolongs survival in a large group of patients with advanced cancer, the results can quite reasonably be checked in “well-established” subgroups that are demarcated by such well-known prognostic factors as age and severity of clinical stage. The subgroups would not be well established or well documented if they were “dredged” from data for such factors as occupation or serum sodium level. Analysis of results within clinically cogent subgroups40 would therefore be an appropriate scientific activity, requiring no mathematical penalties. Because the explan - atory tests would be done to confirm and explain the original hypothesis, not to generate new ones, the level of α need not be penalized for the tests.
25.2.4.2Vaguely Anticipated — Suppose 8 active analgesic agents — A, B, C, D, E, F, G, and
H — are compared in the same trial, in addition to aspirin and placebo. The investigators who do such a 10-“arm” trial might check all of the 45 (= 10 × 9/2) pairwise comparisons afterward, but the main focus might be only on the 8 comparisons of active agents vs. aspirin, and the 9 comparisons of active agents vs. placebo. The original scientific hypothesis was that aspirin would be inferior to at least one of the non-aspirin agents, but its identity was not stipulated. Hence, a superior individual result, if found, can be regarded as vaguely anticipated. A stochastic penalty might be imposed for the 8 tests of active agents vs. aspirin, but not for all the other comparisons.
25.2.4.3Uncertain — Examples of multiple explorations for previously uncertain hypotheses can be found in most retrospective case-control studies. In cohort research, a counterpart example is provided by the results now published for more than 20 “cause–effect” relationships examined in a group of about 100,000 nurses who were assembled and “followed-up” by responses to mailed questionnaires.4,41
Because most of the “risk” information would not be collected if it were regarded as wholly unrelated to the outcome event, the investigators may argue that any positive results in the multiple comparisons were “anticipated.” Furthermore, stochastic penalties are obviously not needed in certain analyses where special additional data were collected for tests of specific scientific hypotheses.41 On the other hand, if the relationship between an individual risk factor and an individual outcome were not deliberately stipulated in advance as a focus of the research, the scientific hypothesis for each test can be regarded as uncertain. A stochastic penalty might be imposed, but it need not be quite as harsh as the draconian “punishment” warranted for totally unknown hypotheses discussed in the next subsection.
25.2.4.4Unknown — The data-dredging activities mentioned earlier are the most glaring example of the multiple-comparison problem, which is abetted by the modern availability of digital computers that can store and easily process huge amounts of data. With this capability, computer programs can be written to check any appropriate “risk” variable against any appropriate “outcome” variable in any suitably large collection of data. The computer can then print stars or ring bells whenever a “significant” distinction is found.
In one famous example of this type of “hypothesis-generating” adventure, previous exposures to medication were associated with hospital admission diagnoses for an estimated 200 drugs and 200 diseases in a “data bank” for a large group of patients.42 Among the “significant” results of the explorations, which were not stochastically corrected for multiple comparisons, was a relationship suspected as causal between reserpine and breast cancer. 43 After diverse sources of bias were later revealed in the original and in two concomitantly published “confirmatory” studies,44–45the relationship has now become a classic teaching example of sources of error in epidemiologic research.
If the analytic procedures are pure acts of exploration, data-dredging, or other exercises in “looking for a pony” whose identity was not anticipated beforehand, some type of intervention is obviously needed to protect the investigators from their own delusions. For example, if the primary hypotheses were not supported by the main results of a clinical trial, and if the investigators then explore other outcome events (such as individual manifestations or laboratory data) hoping to find something— anything — that can be deemed “significant,” the process is a counterpart of data dredging. Similarly, in various epidemiologic studies, when diverse variables are checked as “confounders” without having been previously identified as such for the relationship under investigation, the results are obviously not a product of stipulated scientific hypotheses.
In the absence of both an established strategy for “common sense” and data that allow appraisal of architectural validity, the most consistent approach is to apply a mathematical adjustment. The mathematical penalty could be harsh for previously unknown exploratory hypotheses, slightly reduced for uncertain hypotheses, and “softened” for hypotheses that were vaguely anticipated or tested repeatedly in sequential evaluations. A more powerful alternative strategy, of course, is to apply rigorous scientific standards to raise quality in designing and evaluating the architecture of the research, rather than using arbitrary mathematical methods to lower the α levels.
25.2.5Commonsense Guidelines
Regardless of whether and how the foregoing scientific guidelines are applied, several non-mathematical principles of scientific common sense can always be applied.
25.2.5.1Acceptibility of “Proof” — No hypothesis derived from statistical data can ever be regarded as proved by the same data used to generate it. At least one other study, with some other set of data, must be done to confirm the results. For example, in the previously cited erroneous accusation that reserpine was causing breast cancer, a causal claim would probably not have been accepted for publication if supported only by results of the original data-dredged study.43 The claim was published because it had been “confirmed” (at least temporarily) by two other investigations44,45 elsewhere.
25.2.5.2Check for Scientific Error/Bias — Many unexpectedly “positive” relationships arise not by stochastic chance alone, but from bias and/or inaccuracy in the compared groups and data. The appropriate exploration and suitable management of these scientific errors will do much more to eliminate “false positive” results than anything provided by mathematical guidelines.
25.2.5.3Demand for “Truthful” Reporting — Many clinical trials and other research projects today are done only after a sample size has been calculated, using selected levels of δ , α , and perhaps β . The previously set levels should preferably be maintained in the analysis and always be
cited when the research is reported.
Another aspect of “truthful” reporting would be a citation of all the comparisons that were made before the analysis hit “bingo.” The citation would allow readers to determine whether and how much of a stochastic or other “penalty” should be imposed. For example, in the erroneous coffee-pancreatic cancer study mentioned earlier,5 a reader of the published report could discern that at least six relationships had been explored, but the investigators had probably checked many others before achieving “success” with coffee.
25.2.5.4Focus on Quantitative Significance — In many reports, the boundaries set for quantitative distinctions are not cited, particularly when the “delta wobble” lowers the initial value of δ
so that stochastic significance is claimed for a value of do that might not have been previously regarded as impressive. If readers and reviewers (as well as investigators) begin to focus on the quantitative
importance rather than stochastic significance of the numbers, many of the existing problems can be avoided or eliminated. The dominant role of sample size should always be carefully considered when
analyses of large groups of persons (and data) produce impressive P values (or confidence intervals) for distinctions that may have little or no importance. With emphasis on the quantitative magnitude of the distinction rather than its stochastic accompaniment, many issues can be resolved on the basis of triviality in descriptive magnitude, without the need to adjust levels of α .
(The concept of quantitative magnitude will also require careful consideration if Bayesian methods are used to estimate prior probability, because both the prior and posterior probabilities will involve the “effect size” of anticipated and observed distinctions.)
If quantitative magnitude continues to receive its current neglect, however, investigators may preserve and increase the contemporary emphasis on stochastic confidence intervals and P values. An excellent example of the emphasis was presented in a guideline suggested for dealing with the problem of multiple comparisons: “Any unexpected relationships should be viewed with caution unless the P value is very extreme (say less than 0.001). However, this difficulty does not affect the relationships assessment of which would have been expected a priori. Paradoxically, in a study such as ours, a correlation significant at the 1 in 20 level demonstrating an expected relationship is more reliable than an unexpected correlation significant at the 1 in 200 level.”46
The progress of science and statistics in the 21st century will depend on investigators’ ability to overcome this type of stochastic infatuation.
25.3 Sequential Evaluation of Accruing Data
A different type of multiple-comparison problem, which is almost unique to long-term randomized clinical trials, occurs when the same hypothesis is tested repetitively as data accrue for the results of the trial.
The repetitive testing is usually motivated by both pragmatic and ethical reasons. Pragmatically, the investigators would like to keep the trial as small and short as possible, getting convincing results with a minimum of patients, duration, and costs. Ethically, the investigators worry about continuing a trial beyond the point at which the conclusions are clear.
The ethics of randomized trials are a thorny problem. Some writers47 argue that most randomized trials are unethical because the investigators’ main goal is to get convincing evidence of a superiority that they already believe. In this view, a trial is ethical only when initiated with genuine uncertainty, under the “equipoise” principle48 that none of the compared agents has been demonstrably superior (or inferior). Most investigators, using their own concept of the term demonstrably, can usually justify the equipoise belief; but if the belief is altered by accruing results that show one of the agents to be unequivocally better (or worse), the investigators may want to stop the trial promptly before it reaches the scheduled ending.
25.3.1Sequential Designs
The challenge can be approached with a sequential design, which makes advance plans to stop the trial as soon as a definitive conclusion emerges.
The general statistical method called sequential analysis was first developed during World War II by Wald49 in the U.S. and by Barnard50 in the U.K. The method was initially used as a quality control test in the manufacture of artillery and was later applied in other industrial circumstances. The method relied on repeated sampling, however, and the theory required suitable modification by Armitage51 to become pertinent for the single group of people who enter a randomized clinical trial.
In the plans of a sequential design, the patients are arbitrarily arranged in pairs as they successively enter the trial. After the compared treatments, A and B, have been randomly assigned and carried out for each pair of patients, the examined results may show either a tie or that A or B is the “winner.” As secular time progresses, an accruing tally of scores is kept in a graph on which the X axis is the number of entered pairs. The tally on the Y axis goes up or down for each A or B winning score. For a tie, the X axis advances by one, with no change in the level of Y.
Before the trial begins, special “outer” boundaries are statistically calculated for stopping the trial with a “significant” result when the accruing tally in favor of A or B exceeds the corresponding upper or lower boundary. The trial can also be stopped for having “insignificant” or “inconclusive” results if the tally crosses a separate “inner” boundary before reaching any of the outer boundaries of “significance.” In certain drawings, the sloping outer and inner margins of the graphical pattern, shown in Figure 25.1, resemble a “pac-man” symbol.
Despite the appeal of letting a trial be stopped at the earliest possible moment with the smallest necessary sample size, the sequential-design method is seldom used. The main problem is feasibility. To find a prompt “winner” for each pair of patients, the outcome event must be something that occurs soon after treatment. The type of therapy that can be tested is therefore restricted to agents used in treating acute pain, acute insomnia, or other conditions where the outcome can be determined within a relatively short time, such as 24 hours after treatment. In the trial52 shown in Figure 25.1, the investigators applied an unusual cross-over design to compare an active agent vs. placebo in patients with a known high attack rate of (“catamenial”) epilepsy associated with the menstrual cycle.
An additional problem in sequential designs is that reliance on a single outcome event — such as relief of symptoms — may produce stochastic significance for that event, but not for associated manifestations that the investigators would also like to examine. Furthermore, unless the treated condition has a remarkably homogeneous clinical spectrum, the arbitrary pairing of consecutive accruals may produce comparisons for patients with strikingly different baseline severity of the clinical condition. Although the randomization should eventually equalize the total disparities of severity in each pair, the trial may be stopped before enough patients have been accumulated to allow either a balanced distribution of severity or an effective comparison of treatment in pertinent clinical subgroups.
25.3.2N-of-1 Trials
An interesting variant of sequential designs is the N-of-1 trial, which was introduced53 to help choose optimum treatment for a single patient. If the patient has a chronic condition (such as painful osteoarthritis)
Excess preferences
FIGURE 25.1
20
Clobazam better
than placebo
16
12
8
4
Number of preferences
0
8
12
16
20
24
28
32
4
-4
2α = 0.01
1-β= 0.95
-8
θ 1= 0.95
-12
N= 33
-16Placebo better than Clobazam
-20
Sequential analysis design for comparison of clobazam and placebo in the suppression of seizures associated with menstruation. [Figure taken from Chapter Reference 52.]
or a frequently recurrent condition (such as dysmenorrhea or migraine headaches), and if several treatments are available for the condition, practicing clinicians have usually tried the treatments sequentially. Agent B was given if Agent A failed to be successful, and Agent C if Agent B failed. The N-of-1 trial provides a formal evaluative structure — with randomization, double-blinding, etc. — for appraisals that were previously done informally and sometimes inefficiently.
N-of-1 trials are limited, however, to clinical situations where the patient recurrently reaches the same baseline state before each treatment, and where the outcome does not take an inordinately long time to occur. Aside from issues in feasibility, the main statistical challenges in the trials are to choose the phenomenon that will be evaluated quantitatively as the main “outcome,” and to decide whether the evaluation will depend on a simple preference for Agent A vs. Agent B, on purely descriptive evidence, or on tests of stochastic significance for the accrued data.54
25.3.3Interim Analyses
Because sequential designs are seldom feasible and because N-of-1 trials are limited to individual patients, the long-term efficacy of many therapeutic agents is tested in conventionally designed random - ized trials, with an advance calculation of sample size. As the trial progresses, however, the investigators then do interim analyses (often called “peeks at the data”) to see what is happening.
25.3.3.1Quantitative and Stochastic Phenomena — The “premature” demonstration of superiority for one of the compared treatments may involve quantitative or stochastic distinctions. The prematurity has a quantitative source if the compared difference in rates of good (or bad) outcomes is unexpectedly much greater than anticipated. With a particularly large incremental difference, the quan - titative distinction between the treatments will be stochastically significant much sooner than was expected in the original plans for sample size and secular duration.
The prematurity has a stochastic source, however, when “significance” is obtained by the progressive enlargement of group sizes. With this enlargement, the same quantitative difference in the same outcome event at the same serial time (such as 6-month survival) that was “nonsignificant” in an early secular comparison may become stochastically significant in a later comparison that included many more people. This phenomenon can arise because the participating patients in long-term randomized trials are seldom admitted all at once. Instead, they are accrued over a secular period of calendar time. Thus, if the plan calls for 500 patients to be admitted, 200 may be recruited in the first calendar year of the trial, 200 more in the second year, and the last 100 in the third year. A comparison of 1-year survival rates for the treatments will therefore include enlarging groups at successive secular dates; and results for 1-year survival of all 500 patients cannot be obtained until at least the fourth calendar year of the trial.
Another source of premature “significance” is both quantitative and stochastic. As noted in Chapter
23, the Neyman-Pearson calculation for “double significance” will produce a sample size much larger
than what is needed if the observed do exceeds the anticipated “big” δ . With the “inflated” sample size, however, a value of do much smaller than δ can become stochastically significant. As the group sizes accrue in the trial, stochastic significance can be reached with do < δ , while the group size is also still smaller than the originally scheduled N.
25.3.3.2Problems in Interim Examinations — To determine whether a stochastically significant difference has occurred in the accruing data, the investigators may regularly examine the interim
results at sequential intervals of calendar time.55 If the quantitative value of the observed quantitative
distinction, do, is unexpectedly higher than the anticipated δ , stochastic testing is easily justified. On the other hand, if the original value of δ is preserved, and if sample size was originally calculated for “single significance,” and if do is required to reach the level of δ , stochastic significance cannot be obtained until the full sample size is accrued.
The main problems arise when sample size has been calculated with the Neyman-Pearson strategy for
“double” significance. Because the interim analyses are almost always done to test the original null hypothesis (which does not specify a value for δ ), the “inflated” value of N can readily allow stochastic
significance to occur with do < δ , long before the planned numbers of patients are fully accrued. The frequency of the “delta wobble” problem and the interim stochastic tests could both be substantially reduced if a preserved level of δ were demanded for any declaration of “significance.” As noted in Chapter 23, however, investigators usually lower the original value of δ to correspond to whatever smaller value of do produced stochastic significance. Consequently, with the prime quantitative distinction neglected, the main issue is no longer “scientific.” It becomes mathematically relegated to a stochastic mechanism for adjusting the boundary of α .
25.3.3.3 “Stopping Rules” — A series of current proposals for adjusting α are called “stopping rules” for clinical trials.56,57 The rules contain limitations on the number of interim analyses, and offer boundaries for the reduced α′ to be used at each analysis. With the Haybittle–Peto technique,58,59 α′ is kept at a strict constant level (such as .001) for all the interim analyses, but returns to the original α (such as .05) at the conclusion if the trial is not ended prematurely. In Pocock’s proposal,60 which resembles a Bonferroni adjustment, α′ is set according to the number of interim analyses. With five analyses, α′ would be 0.016 to achieve an overall α of .05 at the end of the trial. With the O’Brien–Flem - ing method,61 the α′ levels are stringent for the early tests and more relaxed later. Thus, the successive α′ levels might be .0006 for the first, .015 for the second, and .047 for the third interim analysis.
O’Brien and Shampo62 have offered an excellent discussion and illustrations of these different rules, while also citing some ethical (as well as scientific) reasons for continuing a trial until its originally scheduled completion, despite the “stopping rules.”
25.3.3.4“Stochastic Curtailment” — In another interim-examination technique, called stochastic curtailment,”63,64 the analysts consider both the observed and unobserved data to decide whether the current conclusions would change if the trial continues to its scheduled completion. The focus is not merely on clear evidence of benefit or harm (as in the conventional stopping rule), but also on lack of “power” to show an effect. Common non-harm–non-benefit events that can lead to stochastic curtailment are inadequate recruitment of patients, and many fewer outcome events than originally expected.
25.3.3.5“Financial Curtailment” — A different form of curtailment occurs when a study is stopped because the sponsoring agency decides that the continuing costs will not be justified by the anticipated benefits. The early termination of the study may be quietly accepted if the sponsor is a governmental agency (as in termination of the DES cohort study65 by the National Institutes of Health in the U.S.), but may evoke protests about ethical behavior66 if the sponsor is a pharmaceutical company (as in the decision by Hoechst Marion Roussel to stop a European trial of Pimagedine).
25.3.3.6Additional Topics — An entire issue of Statistics in Medicine has been devoted to 25 papers from a workshop67 on “early stopping rules.” Beyond the phenomena already mentioned here, the papers contained the following additional topics: Bayesian approaches, a trial in which the later results took a direction opposite to that of early results, an “alpha spending” approach for interim analysis, “the case against independent monitoring committees,” and a “lay person’s perspective” on “stopping rules.”
25.4 Meta-Analysis
In meta-analysis, the results of a series of studies on “similar” topics are pooled to enlarge the size of the group and volume of data. The enlarged aggregate is then statistically analyzed to check the same hypotheses that were tested previously, although the larger amount of data may sometimes allow new hypotheses to be tested for subsets of the original “causal maneuvers” (such as different doses of treatment) or for subgroups of patients. A major advantage of the meta-analytic procedure is the opportunity to get “significant” conclusions from research that was previously inconclusive because of small groups or nonconcordant results in individual studies.