Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Ординатура / Офтальмология / Английские материалы / Principles Of Medical Statistics_Feinstein_2002

.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
25.93 Mб
Скачать

to enlarge rather than reduce sample sizes, because the latter number (413.5) more than triples the 137.8 that emerged from the previous “double-significance” calculation.

The investigator could now be assured, however, that stochastic confirmation for do Š .10 would be obtained with the sample size of 154 persons in each group, although it is larger than the 138 persons required when double-significance was determined for an unrealistically enlarged δ = .15. Furthermore, if seeking a big difference, the investigator might decide that 414 persons are not needed in each group because stochastic confirmation would not be wanted for the discouragingly small value of do ð .04.

Accordingly, instead of obtaining “double-significance” with 138 persons for an excessively high δ = .15, the trial could be done for single significance with 154 persons and a reasonable δ = .10. Nevertheless, if the investigator really wants to show and confirm a small difference for a realistic boundary of ζ =.04, the trial would require 414 persons in each group. This number would still be smaller, of course, than what would emerge with a “double-significance” calculation using δ = .10. The latter calculation, with .10 rather than .15 in the denominator of Formula [23.10], would produce n Š 511.4.

24.7.3Choices of α and β

If realistic values for δ and ζ were set before the trial and maintained afterward, the main method for reducing sample size would be to change the values of α and β . In the foregoing double-significance calculations, Zβ was already quite relaxed at a one-tailed β = .1, but the value of Zα was chosen for a two-tailed boundary of .05. If a one-tailed direction is accepted for α , the corresponding sample size would be smaller.

For the two separate calculations of single significance, a separate primary hypothesis is set for each calculation, and a β level need not be chosen. The foregoing calculations, however, were done with Zα set at 1.96 for a two-tailed test at α = .05. With each hypothesis having a clear direction, a one-tailed test could be used for each calculation, and the sample sizes would be reduced when Z α is set at 1.645 rather than 1.96.

24.7.4Resistance to Change

Having become thoroughly entrenched, the “double-significance” approach will probably become another instance of an outmoded paradigm that is resistant to change. The resistance will probably be aided by the reluctance of clinical investigators to accept responsibility for setting quantitative boundaries for “big” and “small.” Unlike α and β , which can be designated arbitrarily regardless of what is happening in the research, the appropriate values of δ and ζ will often require ad hoc choices based on substantive content. If both the clinical investigators and statistical consultants, however, are willing to acknowledge that the prime scientific decisions depend on quantitative magnitudes rather than stochastic probabilities, the clinico-statistical collaborators can work together to set appropriate boundaries.

The consequences will be an enlightened improvement in statistical features of both the planning and reporting of medical research. Sample sizes can be determined according to what the investigator wants to show; the sizes will usually be smaller than under the “double-significance” paradigm; and readers of the published reports will be protected from possibly deceptive claims of “significance” that have been adapted to fit the observed results rather than to corroborate well-made plans. The new process would even be consistent with the recommendations (cited in Section 23.9.5) that were offered by both Ronald Fisher and Egon Pearson after they thoughtfully reconsidered and changed their original proposals, respectively, for testing “significance” and for determining “double significance.”

24.8 Evaluating All Possible Outcomes

The three-zone approach would also allow development of a new statistical strategy that emphasizes what the investigator wanted to achieve when the research project was planned. With this strategy, any project

© 2002 by Chapman & Hall/CRC

can have at least eight possible outcomes, formed by two states for each of the following phenomena: the initial desire (or hope) is to find that do is either big (Š δ ) or small (ð ζ ); the observed result can be either desired or not desired; and the result can be either stochastically confirmed or not confirmed. The desired, observed, and stochastic results can then be evaluated as noted in the next two sections.

24.8.1Large Value Desired for do

When a large value is desired for do , the possible findings and conclusions are shown in Table 24.2. In these four situations, the investigator wants or expects the observed result to be big. If the observed result comes out as expected, the stochastic tests will either confirm the scientific hypothesis or show that the study had defective capacity in group sizes. If the observed result is contrary to expectations, the disappointed investigator might be either comforted by the possibility that the result was a stochastic variation or distressed by having to reject the original scientific hypothesis as being probably wrong. (The wrong result, however, might have been produced by bias, rather than by an erroneous scientific hypothesis.)

TABLE 24.2

Stochastic Results and Conclusions When Large Value Is Expected for do

Was Result

Observed

Stochastic Result

 

Desired?

Descriptive Result

for Confidence Interval

Conclusion

 

 

 

 

 

 

 

do Š δ

 

0

excluded

Scientific hypothesis confirmed

YES

 

 

 

 

 

 

 

0

included

Defective capacity in group size

 

 

 

 

do < δ

 

δ

excluded

Scientific hypothesis is probably wrong

NO

 

 

 

 

 

 

 

δ

included

Result may be a stochastic variation

 

 

 

 

 

 

 

 

 

24.8.2Small Value Desired for do

In a study of “equivalence,” when a small value is desired for do, a similar set of conclusions can emerge under the different numerical circumstances shown in Table 24.3.

TABLE 24.3

Stochastic Results and Conclusions When Small Value Is Desired for do

Was Result

Observed

Stochastic Result

 

Desired?

Descriptive Result

for Confidence Interval

Conclusion

 

 

 

 

 

 

 

 

 

δ

excluded

Scientific hypothesis confirmed

YESююююю

do ð ζ

 

 

 

 

 

 

 

 

 

 

 

δ

included

Defective capacity in group size

 

 

 

 

do > ζ

 

0

excluded

Scientific hypothesis is probably wrong

NO þþþ

 

 

 

 

 

 

 

0

included

Result may be a stochastic variation

 

 

 

 

 

 

 

 

 

24.8.3Subsequent Actions

If the scientific hypothesis is confirmed, the investigator has nothing further to do, except perhaps to arrange publication for the research. If the scientific hypothesis is probably wrong, the investigator can

© 2002 by Chapman & Hall/CRC

try to find an alternative explanation by identifying cogent sources of bias. If the possibility of bias does not offer a suitable explanation, the original scientific hypothesis might have to be abandoned.

The other two conclusions are caused by statistical problems that can be solved with additional research. Defective capacity can be augmented in an ongoing project, or the research can be repeated with an adequately large sample. If the result might be a stochastic variation, a new attempt can be made to get stochastic confirmation or rejection by repeating the study with an adequate sample size.

24.8.4Problems with Intermediate Results

The foregoing set of evaluations and conclusions will take care of all eight cited circumstances where do is either Š δ or ð ζ . A different problem arises, however, if the observed value of do is in the intermediate zone between ζ and δ . Such results are almost always undesired and disappointing for the investigator, who usually wants to show that a distinction is either big or small, but not intermediate.

The intermediate result can be evaluated with both sets of procedures for managing unexpected findings, because data showing that ζ < do < δ will usually be unwelcome, regardless of whether the investigator had a “big” or “small” goal. The four possibilities for the intermediate situation are shown in Table 24.4. The confidence intervals will lead to conclusions that either the result is a stochastic variation or one (or both) scientific hypotheses are probably wrong. If both scientific hypotheses are discarded, the conclusion is that do indeed has an intermediate location, which has been stochastically confirmed. With the latter conclusion, the investigated phenomenon has produced a distinction that is too small to be “big” and too big to be “small.” If the investigator still wants something impressively big or small, some other phenomenon (or explanation) should be explored.

TABLE 24.4

Stochastic Results and Conclusions When do Is in Intermediate Zone

Observed

Stochastic Result

 

Descriptive Result

for Confidence Interval

Conclusion

 

 

δ

excluded

Stochastically not “big”

 

 

 

 

 

 

 

 

 

δ

included

Possible stochastic variation from “big”

ζ < do < δ

 

 

 

 

 

 

 

 

 

 

0

included

Possible stochastic variation from “small”

 

 

 

 

 

 

 

 

 

0

excluded

Stochastically not “small”

 

 

24.9 Conflicts and Controversies

Because testing and confirming equivalence is a relatively new activity, many conflicts and controversies have arisen about almost every component of the procedures.

24.9.1Clinical Conditions and Measurements

A major controversy erupted in 1997 when the manufacturer of a “brand name” thyroxine product provided sponsorship and then tried to suppress publication of a study in which the corresponding generic products were found to be bioequivalent and therefore “interchangeable.” The controversy24 included issues of academic freedom, financial conflicts of interest, choice of additional data released by the manufacturer, and the advertisement policy of a prominent medical journal. At the level of scientific discourse, however, the main contentions about the claim of equivalence referred to the appropriateness

© 2002 by Chapman & Hall/CRC

of patients used for the research, the reliability of area-under-the-curve (AUC) calculations, and the adequacy of serum thyrotropin measurements.

24.9.2Quantitative Boundaries for Efficacy and Equivalence

As noted by Greene et al.,18 many investigators have published claims of equivalence without previously establishing a quantitative boundary for ζ . Instead, equivalence has been claimed after a failed stochastic test for “efficacy.” The results of such studies are defective in both quantitative and stochastic decisions about equivalence.

A different problem arises when different boundaries are established in investigations of the same phenomenon. For example, in comparisons of thrombolytic treatment for acute myocardial infarction, the investigators in the GUSTO III trial25 claimed equivalence when an observed difference of do = .53% was less than the preselected ζ = 1%. In the COBALT trial,26 however, the incremental value of do =

.44% was even smaller than the .53% found in GUSTO, but was regarded as “inequivalent” because it exceeded the preset limit of ζ = .40%. The GUSTO investigators25 commented on “the question of an appropriate boundary for the definition of equivalence” and expressed concern that “acceptance of broad statistical definitions of equivalence may compromise previously established benchmarks of therapy.”

24.9.3Stochastic Problems and Solutions

In the absence of well-accepted boundaries for both quantitative distinctions and stochastic locations, and without a suitably symmetric logic for the stochastic evaluations, the evaluation of “equivalence” has many uncertainties and produces many difficulties. The three main problems are outlined in Table 24.5.

TABLE 24.5

Stochastic Problems Arising from Absence of Quantitative Boundaries for “Big” and “Small”

Observed do

Location

 

to Be Stochastically

of Stochastic Hypothesis

 

Confirmed

to Be Rejected

Problem

“Big” 0 (Null)

“Small”

 

 

 

Big

 

(equivalent)

 

 

 

 

Sm all

 

 

 

Large sample may confirm small do as “significant”

Anything smaller than “big” may be confirmed as ‘small’

Huge sample size needed for confirmation

The first two problems produce statistical dissidence: in a test of efficacy, a big sample may confirm a small dO as significant; and in a test of equivalence, a relatively large difference may be confirmed as small. The third problem is the excessively large sample size needed for confirmation of equivalence if the primary stochastic hypothesis is set at a small value, such as ζ.

Table 24.6 indicates how a suitable choice of quantitative boundaries can eliminate the problems noted in Table 24.5.

24.9.4Retroactive Calculations of “Power”

The last topic to be considered before this chapter ends is the retroactive calculation of “power” for an observed do that fails to achieve stochastic significance under the original null hypothesis. Although the proposals for this calculation do not clearly separate defects in power from defects in capacity, the usual assumption is that the observed do is smaller than δ , which is then used as the location of a secondary hypothesis for the power calculation. As Detsky and Sackett27 have pointed out, the location of a

© 2002 by Chapman & Hall/CRC

TABLE 24.6

Boundary Solutions to Problems Noted in Table 24.5

Observed do to Be

Location

Location

 

Stochastically

of Quantitative

of Stochastic

 

Confirmed

Boundary

Hypothesis

Confirmatory Conclusion

 

 

 

 

“Big”

Š δ

0

Can’t be “big” unless do Š δ

 

 

ð ζ

δ

Can’t be “small” unless do ð ζ

 

 

“Small”

 

 

 

 

 

 

ð ζ

ζ

Improper logic: boundary and

 

 

 

 

hypothesis should not be similar

 

 

 

 

 

“rejectable” δ will vary with the group sizes, so that larger groups can increase power for smaller values of δ than otherwise.

In the Detsky-Sackett illustrations, the power of a contrast of two proportions was determined with a chi-square test, not with the customary Z procedure; and the chi-square test28 was done with a new primary equivalence hypothesis that Š δ . The Detsky-Sackett approach, however, was later attacked by Makuch and Johnson,10 who advocated a Neyman-Pearson technique. Both of these approaches were then denounced by Goodman and Berlin,29 who claimed that “power” is a prospective concept and should not be calculated retroactively. Smith and Bates30 also argued that after a study has been completed, “power calculations provide no information not available from confidence limits” and that “once an actual relative risk estimate has been obtained, it makes little sense to calculate the power of that same study to detect some other relative risk.”

Until this controversial issue receives a consensus solution, the best way of estimating how large do might have been is to determine a suitable confidence interval around do , without getting into disputes about demarcating a post-hoc δ and calculating “power.”

References

1. Kirshner, 1991; 2. Armitage, 1971; 3. Zelen, 1969; 4. Anderson, 1993; 5. Chow, 1992; 6. Schulz, 1991; 7. Westlake, 1979; 8. O’Quigley, 1988; 9. Blackwelder, 1984; 10. Makuch, 1986; 11. Wynder, 1987; 12. Feinstein, 1990a; 13. Blackwelder, 1982; 14. Rodda, 1980; 15. Food and Drug Administration, 1977; 16. Jones, 1996; 17. Lindenbaum, 1971; 18. Greene, 2000; 19. Westlake, 1972; 20. Metzler, 1974; 21. Kramer, 1978; 22. Roebruck, 1995; 23. Makuch, 1978; 24. “Bioequivalence,” 1997; 25. GUSTO III investigators, 1997; 26. COBALT investigators, 1997; 27. Detsky, 1985; 28. Dunnett, 1977; 29. Goodman, 1994; 30. Smith, 1992; 31. Carette, 1991; 32. Steering Committee of the Physicians’ Health Study Research Group, 1989; 33. Peto, 1988; 34. Oski, 1980; 35. Pascoe, 1981.

Exercises

24.1. The investigators planning the clinical trial described in Exercise 23.1 realize that their results will be most acceptable if the main endpoint is death rather than improvement in angina. They are unhappy, however, to have to get about 3000 patients to prove the point. Having heard about the potential sample-size savings discussed in Chapter 24, the investigators now want to know how much smaller the sample size might be, with death as the end point, if they ignore the conventional Neyman-Pearson calculations and if, instead, they take precautions both to avoid α error for a “big” difference and to allow stochastic significance to be confirmed if a “tiny” difference is found. For this purpose, δ is set at .02 (a 20% proportionate reduction in death rate) and ζ = .005 (a 5% proportionate reduction). What

© 2002 by Chapman & Hall/CRC

does the sample size become for the new goal? If the results are somewhat disappointing, what is your explanation?

24.2. In a double-blinded randomized trial in which the control group received isotonic saline injections, the investigators concluded that “injecting methylprednisolone into the facet joints is of little value in the treatment of patients with chronic low back pain.”31 To be eligible for the trial, the participants were required first to have had low back pain, lasting at least 6 months, that was substantially relieved within 30 minutes of a lidocaine injection in the facet joint space. Two weeks later, the back pain should have returned to at least 50% of its pre-lidocaine level. Patients meeting these requirements were then entered in the trial.

24.2.1.After deciding that “significant benefit” would be the outcome for “only patients who reported very marked or marked improvement,” the investigators “calculated that a sample size of 50 patients per group would be adequate at 80 percent power to detect at the 5 percent level of significance [by one-sided test] an estimated improvement in 50 percent of the patients given corticosteroid [and in] … 25 percent of those given placebo.” Demonstrate the calculation that you think was used to obtain the estimate of “50 patients per group.”

24.2.2.In the published report, very marked or marked improvement was noted as follows in the two groups:

Time

Methylprednisolone

Placebo

Difference and 95% CI

 

 

 

 

 

One month

42% (= 20/48)

33%

(= 16/48)

9% (–11 to 28)

Six months

46% (= 22/48)

15%

(= 7/47)

31% (14 to 48)

 

 

 

 

 

In reaching the stated conclusion, the investigators believed that the six-month differences could be ignored because the methylprednisolone group received more “concurrent interventions,” and because sustained improvement from the first month to the sixth month occurred in only 11 patients in the prednisolone group and in 5 of the placebo group.

Do you agree that these results justify the conclusions that “injections of methylprednisolone…are of little value in the treatment of patients with chronic low back pain”?

24.2.3.What aspect of the clinical and statistical design and analysis of this trial suggests that the trial was not an appropriate test of the hypothesis?

24.3.Exercise 10.1 was concerned with two clinical trials devoted to the merits of prophylactically

taking an aspirin tablet daily (or every other day). In Exercise 14.4.1, you concluded that the U.S. study had an excessive sample size. Can you now apply a formula and appropriate assumptions re α , β , δ , etc. that will produce the sample size used for the trial in Exercise 14.4?

24.4. In a controlled pediatric trial of gastrointestinal symptoms produced by iron-fortified formulas for infants,34 the mothers reported cramps for 41% (= 20/49) of infants receiving formula with no iron, and for 57% (= 25/44) of those receiving iron-fortified formula. The investigators concluded that “our study failed to provide any evidence for the commonly held belief that iron-fortified formulas produce gastrointestinal side effects in infants.”

In a subsequent letter to the editor, titled “Was It a Type II Error?” the writer35 claimed that the observed difference was “clinically important” but underpowered. According to the writer, “β is > 0.5” for the observed difference, but a “larger sample (about 150 per group) would have probably ( β = 0.35) generated a significant 16% difference.”

24.4.1.Do you agree with the investigators’ original conclusions?

24.4.2.Do you agree with the basic dissent in the letter to the editor (i.e., that the investigators haven’t proved their claim)? If you agree with the dissent, do you agree with the way it has been expressed? If not, suggest a better expression.

24.4.3.Do you agree with the dissenter’s claim that stochastic significance would require about 150 per group, and that such a size would be associated with β = 0.35? Show the

calculations that support your answer.

© 2002 by Chapman & Hall/CRC

25

Multiple Stochastic Testing

CONTENTS

25.1Formation of Multiple Comparisons

25.1.1Example of Problem

25.1.2Architectural Sources

25.1.3Analytic Sources

25.2Management of Multiple Comparisons

25.2.1Stochastic Decisions

25.2.2“Permissive Neglect”

25.2.3Scientific Validity

25.2.4Specification of Scientific Hypotheses

25.2.5Commonsense Guidelines

25.3Sequential Evaluation of Accruing Data

25.3.1Sequential Designs

25.3.2N-of-1 Trials

25.3.3Interim Analyses

25.4Meta-Analysis

25.4.1Controversies and Challenges

25.4.2Statistical Problems

25.4.3Current Status

References

Exercises

All of the previously discussed stochastic analyses were concerned with appraising results for a single scientific hypothesis. The scientific goal, the observed result, and the stochastic test may or may not have been in full agreement (as discussed throughout Section 24.8), but the testing itself was aimed at a single main scientific hypothesis.

This chapter is concerned with three situations that involve multiple stochastic testing. It can occur for different hypotheses in the same set of data, for repeated checks of accruing data for the same hypothesis, or for new tests of aggregates of data that previously received hypothesis tests. In the first procedure, which is often called multiple comparisons, the results of a single study are arranged into a series of individual contrasts, each of which is tested for “significance” under a separate stochastic hypothesis. In the second procedure, often called sequential testing, only a single hypothesis is evaluated, but it is tested repeatedly in a set of accumulating data. In the third procedure, which occurs during an activity called meta-analysis, a hypothesis that has previously been checked in each of several studies is tested again after their individual results have been combined.

25.1 Formation of Multiple Comparisons

The first part of the chapter is concerned with the controversial problem of multiple comparisons, which warrants a more extensive discussion than the brief outline offered in Section 11.10.

© 2002 by Chapman & Hall/CRC

25.1.1Example of Problem

To illustrate one aspect of the problem, suppose four different therapeutic agents — A, B, C, and D — are tested in the same randomized clinical trial. When the trial is over, the results for a single outcome, such as “success,” are compared in each pair of groups: A vs. B, A vs. C, A vs. D, B vs. C, B vs. D, and C vs. D. Because k groups can be paired ink(k – 1)/2 ways, 6 pairs of comparisons can be done for the 4 groups.

If α is set at .05, and if the null hypothesis is correct that all four agents in the trial are essentially similar, each “positive” comparison has a .05 chance of being a false-positive result, and a .95 chance of being truly negative. For any two comparisons, the chances are .95 × .95 = .90 that both will be truly negative and 1 – .90 = .10 that at least one of the two comparisons will be falsely positive. For six comparisons under the null hypothesis, the chance is (.95)6 = .735 that all six will be negative, and 1 – .735 = .265 that a false positive result will emerge somewhere in the group of six tests. Thus, although set at .05 for each comparison, the operational level of α for the total of six pairs of comparisons becomes elevated to .265.

If each pair in a set of k comparisons has a 1 − α chance of being truly negative, the overall chance of getting a true negative result is (1 − α )k. The chance of getting a false positive result somewhere in the set becomes 1 (1 − α )k. Thus, with α designated at .05 for each of 20 comparisons, the chance that at least one of them will be positive by stochastic variation alone is 1 (.95)20 = 1 – .358 = .642. For 100 comparisons, (.95)100 = .0059; and the chance of getting at least one false positive result will be .9941. Therefore, even if nothing is really “significant,” stochastic significance can almost surely occur by random chance alone somewhere in the set of data if enough comparative tests are done.

Because of this problem, the α level for a single stochastic comparison may no longer be pertinent if the data of a particular study are tested in a series of comparisons. The difficulty has received many names. It is most often called the multiple-comparison problem, because the stochastic tests are usually applied to multiple two-group contrasts; but multiple-association has been the label when the multiple testing occurs in correlation or regression analysis. Both of these titles are covered in Miller’s generic but longer name, simultaneous statistical inference,1 for which multiple inference might be a shorter term.

25.1.2Architectural Sources

Often approached as a purely statistical problem, multiple comparisons can usually be “sensibly” analyzed if their scientific sources are appropriately considered. These sources are the questions asked for data arising from architectural components that can be agents, outcomes, subgroups, and time.

25.1.2.1 Multiple Agents or Maneuvers — The term agent or maneuver can be used for the entity regarded as a possible cause, risk, or impacting factor for the “effect” noted as an outcome event. The maneuver can be examined in a prospective (or “longitudinal”) manner when “exposed” cohort groups are followed in observational studies or randomized trials. The maneuver can also be determined in retrospect for the “outcome” groups analyzed in case-control studies and other forms of cross-sectional research.

25.1.2.1.1 Cohort Studies. In clinical trials or in nonrandomized cohort studies, pragmatic conve - nience may make the investigators examine several agents simultaneously. In the “four-arm” randomized trial discussed in Section 25.1.1, agents A, B, C, and D might have been examined concomitantly because the investigators wanted to learn all they could from the expensive complexity of arranging personnel, laboratory facilities, and other “apparatus” for the trial, without incurring the extra costs and efforts of doing separate trials to compare each pair of agents. Besides, the comparisons might be impeded if the trials were done at several locations (where the patient groups might differ) or at different periods in calendar time.

Another arrangement of multiple agents occurs when a “dose-response curve” is analyzed for parallel groups of patients receiving fixed but different doses of the same substance. A placebo may or may not

© 2002 by Chapman & Hall/CRC

25.1.2.1.2 Case-Control Studies.

be used in such studies; a small, almost homeopathic dose of the active agent is sometimes substituted for a placebo.

In the two examples just cited, the mulitiple agents delineated the groups under study. In certain observational cohorts, however, specific agents have not been deliberately assigned. Instead, the investigators collect information about a series of baseline variables, which can then be analyzed as “risk factors” when pertinent events later occur as “outcomes.” The pertinent outcome events may have been individually identified beforehand, such as coronary heart disease in the Framingham Study,2,3 or may be chosen according to ad hoc topics of interest, as in the Harvard Nurses Study.4

Retrospective case-control research is a fertile source of stochastic tests for multiple agents. After the studied groups have been assembled according to presence or absence of the outcome event, the investigative inquiries, aimed in a backward temporal direction, can seek evidence of antecedent exposure to many etiologic agents or risk factors. If the results do not incriminate the main agent(s) under suspicion, the investigators may then “screen” all the other possible agents for which information was collected.

Some 20 years ago, this type of screening led to a highly publicized but now discredited accusation that coffee drinking was causing pancreatic cancer. The accusation came from a case-control study5 in which the main etiologic suspicions were originally directed at tobacco and alcohol. After these suspects were “exonerated” by the data, the investigators explored an unidentified number of additional agents, from which coffee emerged as having a “statistically significant” association with the cases of pancreatic cancer. The “significance” was not corrected for multiple comparisons,6 however, and the association was refuted when the research was repeated by the original investigators.7

Multiple comparisons are easily and regularly arranged in case-control studies of “risk factors” for etiology of disease. The investigators can “round up the usual suspects” by collecting information about diverse features of demographic status (age, sex, education, income, social status), past medical history (childhood and other previous diseases), family medical history (diseases in relatives), environmental exposure (pets, atmospheric pollution, travel abroad, home conditions), occupational exposure (fumes, chemicals), and personal habits (smoking, alcohol, dietary components, physical activities, cosmetics, hair dyes). In the usual questionnaire for such studies, more than 100 candidate variables can readily be assembled and then checked for their statistical relationship to the occurrence (or nonoccurrence) of the selected outcome disease.

25.1.2.2Multiple Outcomes — Multiple outcome events will occur in any cohort study, regardless of whether the “causal” maneuvers are self-selected or assigned as part of a randomized trial. To satisfy regulatory agencies, reviewers, or readers and to provide a specific focus in design of the research, one of the many possible outcomes is usually designated as the prime target. The many other outcome events, however, are still observed, recorded, and then available as data that can be examined for purposes of confirmation, explanation, or exploration.

In a confirmatory examination, the desired primary outcome was achieved; and the confirmation provides additional, consistent evidence. For example, if imaging evidence shows that a thrombus dissolved or disappeared as the primary goal of treatment, additional laboratory evidence of clotdissolution effects would be confirmatory. In an explanatory examination, the additional evidence would help explain why the primary effect occurred. For example, a reduction in fever and white blood count could help explain the disappearance of pain or malaise in patients with a bacterial infection.

In an exploratory examination, however, the desired primary outcome was not achieved. The investigator may then check the additional outcome events, hoping to find some other evidence of a beneficial (or adverse) effect. Thus, if mortality was not reduced with an anti-leukemic treatment, the investigators may search for efficacy in evidence of remission, reduction in number of white cells, or other selected targets.

25.1.2.3Multiple Agents and Outcomes — In a case-control study, the diverse risk-factor candidates can be explored only in relation to the specific disease chosen for the cases; but in a cohort

©2002 by Chapman & Hall/CRC

or cross-sectional survey, a diverse collection of diseases (or other conditions) can also be available as outcome events.

An extraordinary opportunity for multiple comparisons occurs in collections of data where the inves - tigators can check simultaneously for multiple agents and multiple outcomes. Thus, if adequate data are available for 100 candidate “risk” variables and 50 outcome diseases, 5,000 relationships can be tested. Opportunities for these “mega-comparisons” arise when information is collected in a medico-fiscal claims system or in special data banks for pharmacologic surveillance of hospitalized patients. In both types of data banks, the investigators will have information about multiple therapeutic agents that can then be explored as etiologic precursors of multiple diseases.

The exploration is sometimes called a “fishing expedition,” “mining the data,” or “dredging the data bank.” If data are available for 200 therapeutic agents and 150 diseases, 30,000 relationships can be explored. With α set at .05 for each relationship, a “positive” result could be expected by chance alone in 1500 of the explorations, even if no true relationships exist.

25.1.2.4Multiple Subgroups — If nothing exciting emerges from all the other possibilities, the relationships of agents and outcomes can be checked in different subgroups. The subgroups can be demarcated by demographic attributes such as age, sex, race, occupation, religion, and socioeconomic status; by medical attributes such as stage of disease and laboratory tests; and by geographic regions.

For example, in a large case-control study of the relationship between saccharin and bladder cancer, 8 the investigators were apparently disappointed to find an overall odds ratio of essentially 1. After multiple subgroup analyses, however, a significantly elevated odds ratio was found in two subgroups: white men who were heavy smokers and nonsmoking white women who were not exposed to certain chemical substances.

Probably the greatest opportunity to explore multiple subgroups occurs if the analyst has access to data collected for different geographic regions, such as individual nations or parts of a nation. For the United States, the nation can be divided into whatever zones the investigator wants to check: individual states (such as West Virginia), large regions (such as “the South”), counties, cities, political districts, or census tracts. If outcome events such as death or disease are available for these zones, and if exposures can also be demonstrated or inferred from concomitant data about occupation, industry, air pollution, or sales of different substances, an almost limitless number of explorations is possible, particularly because the relationships between any outcome and any exposure in any geographic zone can also be checked in multiple demographic subgroups. Explorations in search of a particular relationship have sometimes been called “data torturing,”9 and, when zealously applied, “torturing the data until they confess.”

Aside from the problem of mathematical corrections for “chance” results, the scientific interpretation of the findings is always difficult. Do the different occurrence rates of cancer reflect true differences in the zones demarcated for “cancer maps”10 or differences in the use of diagnostic surveillance, testing, and criteria? Has a true “cluster” or mini-epidemic of disease really occurred at a particular site, or is the event attributable to chance when so many possible sites are available for the occurrence?11,12

25.1.2.5Temporal Distinctions — Beyond all the explorations of variables for agents, outcomes, subgroups, and regions, a separate set of opportunities for multiple comparisons is presented by time, which can appear in both secular and serial manifestations.

The word secular is regularly used in an ecclesiastical sense to distinguish worldly or “profane” things from those that are religious or sacred; but in epidemiologic research, secular refers to calendar time. The word serial refers to time elapsed for a particular person since a selected “zero time,” which might be randomization, onset of therapy, or the date of admission to an ongoing study. The two concepts appear in a single sentence if we say that the five-year survival rate (serial) has been steadily rising for a particular cancer during the past thirty years (secular).

Secular trends are regularly examined in epidemiologic studies of mortality rates, cancer incidence, etc. for the same single geographic zones. Because the trends are usually reported with a single index of association, major stochastic problems do not occur. The problem of multiple comparisons would

©2002 by Chapman & Hall/CRC