Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Ординатура / Офтальмология / Английские материалы / Principles Of Medical Statistics_Feinstein_2002

.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
25.93 Mб
Скачать

incremental magnitude of the observed and expected differences will exceed δ − (−δ) = 2δ. Using the corresponding Zγ, the sample size can then be calculated so that

Zγ

 

2δ

[23.14]

----------------------------------2π(1π) ⁄ n

 

 

The value of n becomes

 

 

 

n

Zγ2

[2π(1 – π)]

 

----------------------------------

4δ2

 

 

 

 

Because the denominator contains 4δ2 rather than δ2, the sample size will be about one fourth of what would ordinarily be required with “one-way” significance at Zα.

The gamma strategy will strongly appeal to investigators who would like to reduce the sample sizes (and costs) of clinical trials. Unfortunately, the small group sizes may make the subsequent results fail to achieve stochastic significance for the α and β decisions that continue to be demanded by most reviewers and readers. For example, suppose Zγ is set at 1.645 for a unidirectional .05 level of γ, with δ = .15 and πA = .33. The sample size will be

(1.645)2 (2)(0.405)(.595)

n -------------------------------------------------------------- = 14.5 4(.15)2

Suppose the investigator, ecstatic about the small sample size, decides to double it and enroll 30 people in each group.

If emerging even better than expected (in the desired direction), the results might show pB = 10/30 =

.33 and pA = 15/30 = .50. The pooled value for the common P will be (10 + 15)/(30 + 30); and the standard error will be (25 60)(35 60)(60) ⁄ (30)(30) = .127. With a one-tailed Zα set at 1.645, the confidence-interval component will be (1.645)(.127) = .209, and the interval calculated as .17 ± .209 will include both 0 and δ. Although the γ hypothesis could be rejected, the customary α and β stochastic hypotheses would have to be conceded. Despite the quantitatively impressive do ≥ δ, the stochastic claim of “significance” would probably not be accepted.

23.9.5Prominent Statistical Dissenters

The Neyman-Pearson “double-significance” strategy, which currently dominates the calculation of sample sizes for clinical trials, has been widely promulgated and accepted despite major reservations by prominent statisticians.

According to David Salsburg,23 “R. A. Fisher was strongly opposed to this formulation. He did not believe that one can think of scientific research in terms of type I and type II errors. This type of thinking, he said, belongs in quality control, where the type I error rate predicts the number of good items rejected and the type II error rate predicts the number of bad items accepted.” Fisher himself24 said that the Neyman-Pearson principles come from “an unrealistic formalism” and “are liable to mislead those who follow them into much wasted effort and disappointment.” Lehmann25 more recently described the conflicts in the Fisher and Neyman-Pearson disputes about testing statistical hypotheses, and has proposed a “unified approach ... that combines the best features of both.”

Kendall and Stuart26 say the “crux of the paradox” in the Neyman-Pearson strategy is that “we can only fix two of the quantities, n, α, and β even in testing a simple Ho against a simple [alternative hypothesis],” but “we cannot obtain an optimum combination of α, β, and n for any given problem.” According to Salsburg,23 Sir David Cox believes that “We do not choose in advance a particular p-value for decision making. Rather, we use p-values to compare a number of different possible alternative hypotheses.” W. E. Deming,27 the “guru” of quality control procedures, contended that “there is no such thing as the power of a test in an analytic problem, despite all the pages covered by the mathematics of testing hypotheses.”

Egon Pearson himself,28 reviewing the basic issues over 30 years later, pointed out that the “NeymanPearson contributions” should be regarded not “as some static system” but as part of “the historical

© 2002 by Chapman & Hall/CRC

process of development of thought on statistical theory.” Without overtly recanting the basic strategy, Pearson nevertheless confessed that “the emphasis which we gave to certain types of situations may now seem out of balance.”

Despite these caveats from prominent leaders within the statistical profession, however, the ideas seem to have thoroughly triumphed in the world of medical research.

23.9.6Scientific Goals for δ

The calculation of a doubly-significant sample size has even entered the realm of ethics. According to D. G. Altman,29 “a sample that is too small will be unable to detect clinically important effects … [and is] hence unethical in its use of subjects and other resources.” Advocating a Neyman-Pearson approach, Altman said it will “make clinical importance and statistical significance coincide, thus avoiding a common problem of interpretation.” Arguing against this idea, however, M.R. Clarke30 stated that clinical importance and statistical significance “are two fundamentally philosophic concepts which cannot be made to coincide.”

That the two concepts do not coincide is demonstrated by the frequent absence of statistical attention to an investigator’s scientific hypothesis for the research. If the investigator hoped for a big difference, i.e., do ≥ δ, and if it was found, the next step would be to confirm it stochastically. If the stochastic result was disappointing, i.e., P > α, the power of the “nonsignificant” result would not have to be checked against an alternative hypothesis, because the observed do was already larger than δ. The problem in this case is that the group size was too small, and the numerical deficit could easily be shown with a simple check of capacity.

A test of power would also be unnecessary if the investigator, wanting a big difference, found a disappointingly small one, i.e., do < δ. In this situation, not seeking rejection of the alternative hypothesis, the investigator would be happy to find that δ was included in a simple confidence interval. If the actual magnitude of do is ignored, and if a result is deemed “nonsignificant” merely because P > α, the calculated confidence interval will let both the investigator and the reader see how big the “nonsignificant” difference might have been.

To calculate a specific index of “power,” however, the investigator must choose a value for δ, thereby addressing the challenge of setting a boundary for quantitative significance. Because this challenge has not been vigorously pursued — mainly because investigators have not insisted on it — the mathematical power games are usually played with the proportional-increment values of θ. To evaluate a quantitative distinction by examining only θ, however, is as unsatisfactory as evaluating a distribution of data from its central index only, without considering spread.

Until a better set of guidelines and criteria is developed for choosing δ, calculations of “power” will remain an interesting theoretical exercise that gives a “politically correct” acknowledgment to the Neyman-Pearson mathematical dogma, without the risk of any real scientific thought. As long as the choice of δ remains an intellectually underdeveloped territory, confidence intervals have the major advantage of reasonably answering the question “How big might it have been?” without any arbitrary calculations of “power.”

23.9.7Choosing an “Honest” δ

Although the use of confidence intervals can avoid decisions about “power” after the research is done, both δ and a mathematical strategy must be chosen to calculate sample sizes before the research begins. If the goal is to show a “small” difference, as discussed in Chapter 24, the usual strategy does not involve Neyman-Pearson reasoning, which is aimed mainly at finding a “big” difference, as in most clinical trials.

If the sample always has a large enough capacity to achieve “single significance” by rejecting the null hypothesis when do ≥ δ, the main question is then whether the investigator will want to claim “significance,” “insignificance,” or neither, if the observed difference goes the other way, so that do < δ. The NeymanPearson strategy seems fundamentally unsatisfactory, therefore, because it substantially inflates the sample sizes needed for “single significance” if do turns out to be ≥ δ; and it does not suitably cope with the scope of possible decisions if do turns out to be smaller than δ, as noted in Chapter 24.

© 2002 by Chapman & Hall/CRC

The main scientific challenge is to choose an “honest” value of δ that is fixed in advance and maintained thereafter. It should be the smallest distinction that will still be regarded as quantitatively significant. Thus, if do is deemed impressive at a value of .08 but not at .07, the “honest” δ should be set at .08. In many doubly significant calculations, however, δ has been inflated to levels of .10 or .20 so that the sample size, although also inflated, is kept to a feasible magnitude. Nevertheless, the size may be large enough to confer stochastic significance if do turns out to be .08. With an honest δ set at .08, however, a singly significant sample size, calculated with Formula [14.19] rather than Formula [23.10], will provide ample capacity to do the desired job.

Nevertheless, investigators who hope to get approval for their proposed clinical trials today and who know that the proposal will be reviewed according to “mainstream” statistical principles, will probably have to continue playing the Neyman-Pearson game until a better and more suitable strategy is created. Perhaps the best guideline for the new strategy was offered by Egon Pearson31 himself: “Hitherto the user has been accustomed to accept the function of probability laid down by the mathematicians, but it would be good if he could take a larger share in formulating himself what are the practical requirements that the theory should satisfy in application.” Establishing those requirements and developing the new strategy are fascinating challenges for collaborative clinical-biostatistical research.

References

1. Jacobson, 1992; 2. Farr, 1852; 3. Elmore, 1994a; 4. Gordon, 1983; 5. Morinelli, 1984; 6. Passey, 1990; 7. Prosnitz, 1991; 8. Benedetti, 1992; 9. Feinstein, 1975; 10. Goodman, 1994; 11. Garber, 1992; 12. Baehr, 1993; 13. Garber, 1993; 14. Neyman, 1928; 15. Fleiss, 1981; 16. Freiman, 1978; 17. Fischer, 1997; 18. Kronmal, 1985; 19. Lipid Research Clinics Program, 1984; 20. Lipid Research Clinics Program, 1985; 21. Cohen, 1977; 22. Schwartz, 1967; 23. Salsburg, 1990; 24. Fisher, 1959; 25. Lehmann, 1993; 26. Kendall, 1973; 27. Deming, 1972; 28. Pearson, 1962; 29. Altman, 1980; 30. Clarke, 1981; 31. Pearson, 1976; 32. Stephen, 1966; 33. Uretsky, 1990.

Exercises

23.1.From the description offered in Section 23.1.1, what do you think produced the bias that made the Literary Digest political poll, despite a huge sample size, yield results dramatically opposite to those of the actual election?

23.2.In a clinical trial of medical vs. surgical therapy for coronary artery disease, the investigators expect the surgical group to achieve results that are 30% proportionately better than the medical group. Two endpoints are available for calculating sample size. The “hard” endpoint is rate of death at two years, which is expected to be 10% in the medical group. The “soft” endpoint is improvement in clinical severity of angina pectoris. This improvement is expected to occur in 70% of the medical group.

23.2.1.Using α = .05 and β = .1 for a Neyman-Pearson calculation, what sample size is required for the “hard” endpoint?

23.2.2.Using the same arrangement as 23.2.1, what sample size is required for the “soft” endpoint?

23.2.3.In view of the smaller samples needed with the soft endpoint, why do you think the hard endpoint is so popular?

23.2.4.What sample sizes would you propose for quantitatively significant results in the hard and soft endpoints if you do not use the Neyman-Pearson approach? How do these sizes compare with what you obtained in 23.2.1 and 23.2.2?

23.3.In a multicentre trial of treatment for acute myocardial infarction,32 the in-hospital mortality rate

was 15/100 (15%) in patients receiving propranolol, and 12/95 (13%) in those receiving placebo. Aside from stating that the trial “demonstrated no difference in mortality,” the authors drew no therapeutic recommendations, such as abandoning the use of propranolol for patients with acute myocardial infarction.

© 2002 by Chapman & Hall/CRC

[PC - PT] x 100
FIGURE E.23.5
Ninety percent confidence limits for the true percentage dif-
ference for the 71 Trials. The vertical bar at the center of each
ˆ ˆ
interval indicates the observed value, PC – PT, for each trial. [Figure taken from Chapter Reference 16.]

23.3.1.Why do you think no recommendations were made?

23.3.2.What is the chance that these results arose from a “parent universe” in which mortality rate with propranolol is actually 10% incrementally below that of placebo?

23.4.Enoximone, a phosphodiesterase inhibitor

that is a derivative of imidazole, had been observed to exert an inotropic cardiac effect in hemodynamic studies of patients with moderate to moderately severe (New York Heart Association Class II or III) congestive heart failure. The manufacturers of the drug therefore sponsored a randomized placebo-controlled trial33 to determine whether enoximone, when combined with digoxin and diuretics, improves symptoms and exercise tolerance of such patients.

The disappointing results showed that in 50 patients receiving enoximone and 52 receiving placebo, “there were no significant differences in exercise duration between groups at any time point” and the “symptom scores for dyspnea, fatigue, overall functional impairment and NYHA class were similar in both groups.” In addition, “The dropout rate was significantly higher (P < .05) in the enoximone group than in the placebo group (46% enoximone, 25% placebo).” At the end of the study period, “there were 10 deaths in patients assigned to enoximone (20%)” and “three deaths (6%) in the patients assigned to placebo (P < .05).”

-50 -40 -30 -20 -10

0 +10 +20 +30 +40 +50

Favoring control

Favoring treatment

23.4.1. In the discussion section of the report, the investigators considered various pharmacophysiologic reasons why the drug had failed (dose too high, excessive response in placebo group, incomplete exercise testing, etc.) but did not mention

any statistical tests for the possibility that the results were wrong.

What would you do to check the possibility that enoximone is really a superior agent and that the differences in drop-out rates and deaths were due to the stochastic fickleness of fate? Illustrate your idea with at least one calculation.

23.4.2.In the published report, the investigators presented a graph showing the mean values of exercise duration at baseline and at 4, 8, 12, and 16 weeks for the enoximone group and for the placebo group. In the legend of that figure, the investigators make seemingly contradictory statements about exercise duration. They say that “there were no significant

differences between groups at any time point,” but they also state that there was “a significant (P < 0.05) increase … compared with baseline values.” Are these two statements contradictory? If not, why not? What kind of tests do you think were done to support the two statements?

23.5.When Freiman et al. concluded that 71 “negative” randomized trials were undersized, their often

cited paper16 had a major impact in promoting the current fashion of calculating “doubly significant” sample sizes for randomized trials. Figure E.23.5 shows the plot of confidence intervals for the 71 trials. Do you believe the authors distinguished between two kinds of problems: (1) a sample size too small to reject the null hypothesis for a quantitatively significant difference, and (2) a sample size too small to reject both the null and the alternative hypotheses?

© 2002 by Chapman & Hall/CRC

24

Testing for “Equivalence”

CONTENTS

24.1Delineation of Equivalence

24.1.1Research Situations

24.1.2Type of Equivalence

24.1.3Basic Design

24.1.4Personal vs. Group Focus

24.1.5Mensurational Problems

24.2Quantitative Boundary for Tolerance

24.2.1Problems and Ambiguities

24.2.2Use of Single Two-Zone Boundary

24.2.3Consequences of “Big” Two-Zone Boundary

24.2.4Use of “Small” Two-Zone Boundary

24.2.5Directional Decisions

24.3Stochastic Reasoning and Tests of Equivalence

24.4Customary Single-Boundary Approach

24.4.1Procedure for “Big” δ

24.4.2Example of a “Classical” Study

24.4.3Procedure for “Small” δ

24.4.4Alternative Hypothesis for Single Boundaries

24.4.5Double-Significance (Neyman-Pearson) Approach

24.5Principles of Conventional Stochastic Logic

24.5.1Identification of Four Principles

24.5.2Reversed Symmetry for Logic of Equivalence

24.5.3Applicability of Previous Logic

24.6Logical (Three-Zone Two-Boundary) Approach

24.6.1Calculations for Advance Sample Size

24.6.2Effect of Different Boundaries for δ and ζ

24.6.3Exploration of Alternative Hypothesis

24.6.4Symmetry of Logic and Boundaries

24.7Ramifications of Two-Boundary–Three–Zone Decision Space

24.7.1Realistic Modifications for δ and ζ

24.7.2Effects on Sample Size

24.7.3Choices of α and β

24.7.4Resistance to Change

24.8Evaluating All Possible Outcomes

24.8.1Large Value Desired for do

24.8.2Small Value Desired for do

24.8.3Subsequent Actions

24.8.4Problems with Intermediate Results

24.9Conflicts and Controversies

24.9.1Clinical Conditions and Measurements

24.9.2Quantitative Boundaries for Efficacy and Equivalence

24.9.3Stochastic Problems and Solutions

24.9.4Retroactive Calculations of “Power”

©2002 by Chapman & Hall/CRC

References

Exercises

In all the stochastic discussions thus far, the investigator wanted to confirm something impressive: to show that a quantitatively “big” distinction was accompanied by a satisfactory P value or confidence interval. The ideas about “capacity,” alternative hypotheses, β error, and “power” that appeared in Chapter 23 were all related to the same basic goal. They offered hope for salvaging something favorable if a trial’s results were disappointingly “negative,” with the desired “significance” being obtained neither quantitatively nor stochastically.

Stochastic hypotheses can also be formulated and tested, however, for at least four other goals beyond the aim of confirming something “big.” For the main goal discussed in this chapter, the investigator wants to show stochastically that an observed distinction, do, is “small” or “insignificant,” rather than “big.” For example, the aim may be to confirm that the compared effects of treatments A and B are essentially similar, rather than substantially different. The other three stochastic procedures, which will be discussed in Chapter 25, involve testing multiple hypotheses.

24.1 Delineation of Equivalence

For the alternative-hypothesis procedures in Chapter 23, the investigator began the research hoping to find a large do, which would be either confirmed stochastically under the primary null hypothesis or at least conceded under the alternative hypothesis. The new stochastic procedures to be discussed now, however, have a diametrically opposite main goal. They are directly aimed at finding and stochastically confirming a distinction that is “small” enough to support the idea that the two compared entities are essentially equivalent.

24.1.1Research Situations

Seeking similarity rather than a difference is the goal in many research situations. An epidemiologic investigator, concerned with “risk factors,” may want to claim that a particular “exposure” is “safe,” i.e., it does not elevate the risk of “non-exposure.” For pharmaceutical “equivalence” in clinical research, the claim might be that a “generic” product has the same effect as the “brand-name” original drug; or that Agent B, which is prepared more conveniently or cheaply than Agent A, is just as effective.

In tests of efficacy for new pharmaceutical agents, the comparative agent has usually been placebo, not only because of its “standard” effect, but also because it avoids having to choose a single comparative agent from among several active competitors. In recent years, however, the use of placebo has been denounced both for ethical reasons (because the patient may be unfairly deprived of an effective agent) and for clinical reasons (because the results of a placebo comparison may not be directly applicable in patient care). Consequently, pharmaceutical efficacy may be increasingly tested in the future with the requirement that a new agent be at least as good as (i.e., equivalent to) an existing active agent.

In a non-pharmaceutical clinical situation, the aim might be to show equal efficacy for a “conservative” therapy. The investigator might want to demonstrate that simple surgery is no worse than radical surgery for treating cancer, that angioplasty gets the same results as bypass grafting for coronary disease, that most patients with acute myocardial infarction can be managed as effectively at home as in the hospital, or that nurse practitioners can work just as well as physicians in giving primary care. A policy planner proposing a new system for lowering the expense of health care may want to get clinical results that are essentially the same as with the old system, while costing less.

Kirshner1 has written a thoughtful review of the many differences in design when the evaluation process is aimed at demonstrating small distinctions for “equivalence,” rather than big ones for “efficacy.” In addition to a major change in the general scheme of stochastic reasoning, the evaluation process is often beset with major sources of ambiguity or confusion in specifying the concept of equivalence itself.

© 2002 by Chapman & Hall/CRC

24.1.2Type of Equivalence

What kind of equivalence is being examined? Is it chemical/physical, therapeutic, biologic, or etiologic? The first type of equivalence refers to such issues as the chemical structure of two pharmaceutical agents or the physical properties of two materials used for surgical sutures. Questions about equivalence for chemical or physical attributes seldom require studies in people, and can usually be answered with laboratory research. The other questions, which are the main topic in this chapter, require testing human subjects.

Therapeutic equivalence refers to effects on an outcome that would be sought in a patient’s clinical care. This outcome might be a laboratory entity, such as a change in antibody titer or blood glucose level, but is often something that is clinically overt, such as symptoms, functional capacity, or survival. Biological equivalence, which is usually called bioequivalence, refers to the bioavailability of a pharmaceutical agent, as measured by various features of its absorption, dissemination, concentration, excretion, or other aspects of pharmacokinetics in the human body. For etiologic equivalence, which can be regarded as a subdivision of biologic equivalence, a particular disease would develop at essentially the same rate in the presence or absence of exposure to a particular “risk factor.”

24.1.3Basic Design

For studies of therapeutic or biologic equivalence, the basic design of the research often involves complex decisions about population, “schedule,” and crossovers. Some of the many fundamental questions that require scientific rather than mathematical answers are the following:

Population: Should the equivalence of pharmaceutical or other agents be tested in people who are sick or well? If one of those two groups is chosen, can the results be extrapolated to the other group?

Schedule: Should the dosage of agent and duration of effect be evaluated with the same criteria, regardless of whether the agent is ordinarily given once a day, twice a day, in multiple daily doses, in a sustained or “long-acting” form, or in absorption via dermal patch or subcutaneous pellet?

Crossovers: Although crossover plans can probably be routinely justified more easily in healthy than in sick people, how many agents can (or should) be crossed over in a single person? Most scientific comparisons involve two agents, but statisticians enjoy making special cross - over plans to compare three or more pharmaceutical agents in each person. The crossover plans, usually called squares,2 are prefixed with such names as Latin, Graeco-Latin, Youden, and lattice. These multi-crossover squares, which were conceived in the agricultural studies that fertilized much of current statistical thinking, are often taught in statistical courses devoted to “experimental design.” Nevertheless, the “square” plans are seldom used in clinical research, because their mathematical elegance almost never overcomes the pragmatic difficulty of conducting, analyzing, and understanding the results of multiple crossovers in people. (A prominent biostatistician, Marvin Zelen,3 has remarked that “the statistical design of experiments … as taught in most schools, seems so far removed from reality, that a heavy dose may be too toxic with regard to future applications.”)

24.1.4Personal vs. Group Focus

The next basic problem is to choose a personal focus for decisions when the same persons are exposed to different agents. Are the decisions aimed at demonstrating equivalence for individual persons, for specified groups, or for a general average?

Anderson4 has illustrated these three types of bioequivalence with the “cases” shown in Figure 24.1. In the first case, which she calls “switchability,” the same person gets essentially the same result with each of two formulations. In case 2, which she calls “prescribability,” the clinician does not know (and may not care) about individual responses as long as the average response is the same, without excessive

© 2002 by Chapman & Hall/CRC

1

2

1

2

1

2

 

Formulation

 

Formulation

 

Formulation

Case 1 Case 2 Case 3

FIGURE 24.1

Levels of outcome in individual patients receiving formulation 1 vs. formulation 2. The pattern of results for equivalence, discussed further in the text, can be called “switchable” (Case 1), “prescribable” (Case 2), or “average” (Case 3). [Figure derived from Chapter Reference 4.]

variability, in the two treated groups. In case 3 of Figure 24.1, however, the variability among both individuals and groups may be too extensive for either formulation to be regarded as “switchable” or “prescribable,” although the average results (such as the means) may still be close enough to allow the two compared formulations to be called “equivalent.”

The statistical procedures used for evaluation, as well as the basic design of the research itself, will vary with the desired type of equivalence.

24.1.5Mensurational Problems

Yet another problem arises in choosing the prime “outcome” variable for measuring equivalence. In most clinical trials, this choice is relatively easy. The target variable is usually either a binary event, such as survival, or the change in a ranked variable, such as blood pressure or pain. In etiologic studies, the outcome is development of a particular disease.

In pharmacologic kinetics, however, many candidate variables can be measured to assess bioavailability. The variables include area under the time curve (AUC) of blood or plasma concentration, the level of maximum concentration (Cmax), or time to maximum concentration5 (Tmax ), as well as such phenomena as plateau time, half-value duration, and several variants of peak-trough fluctuation.6 For antibiotic drugs, effectiveness against specific bacteria can be checked with a minimum inhibitory concentration (MIC), with the time at which MIC is first reached, or with the duration of time for which the concentration remains above MIC.7

Choosing an appropriate measurement to express a drug’s action or a treatment’s accomplishment is an important scientific issue that is beyond the scope of the statistically oriented discourse here. For the subsequent discussion, we shall assume that an appropriate measurement has been selected, and that the results can be evaluated for their quantitative distinctions in that measurement.

24.2 Quantitative Boundary for Tolerance

Perhaps the most crucial statistical decision is the choice of a quantitative boundary. Because we cannot expect the two compared effects to be exactly identical, a boundary of tolerance must be set to demarcate the zone within which different effects can still be regarded as equivalent. This boundary obviously depends on scientific considerations, but its magnitude—which involves answering the question, “How big is small?”—will affect all the subsequent statistical activities.

© 2002 by Chapman & Hall/CRC

The maximum boundary of a “small” increment corresponds to the level of ζ that was briefly considered in Chapters 10 and 23. If an increment is quantitatively small enough for two means to be regarded as “similar” or “equivalent,” we would want XA – XB to be ð ζ. For two proportions, the zone would be pA – pB ð ζ .

24.2.1Problems and Ambiguities

The choice of ζ is a major source of quantitative ambiguity. When equivalence is statistically defined as a difference having “negligible practical interest,” 8 or being below the “minimum difference of practical interest,” 9 the boundary for this difference might be expected to have a small magnitude for ζ . For many years, however, the boundary chosen for “small” in most statistical discussions of equivalence has been the same relatively large δ that was previously used (in Chapters 10 and 23) to demarcate “big.” With the statistical idea that small is the opposite of big, the values of an observed do ≥ δ are big; and the not-big values of do < δ are regarded as small and in the zone of equivalence.

With uncommon exception, most statistical discussions do not use an additional zone to separate “small” from “big.” Even when recognizing that the “traditional statistical framework does not seem appropriate” and that “the observed difference between treatments is relatively small for demonstrating equivalence,”10 the authors may still place the main focus on confidence intervals for which the specific magnitude of a “small” boundary is not demarcated.

In etiologic research, where the statistical comparison is usually expressed as an odds ratio or risk ratio, the magnitude of an “incremental risk” is seldom discussed or considered. Instead, the investigators focus on elevation of the ratio above the “equivalent” value of 1. In this situation, values of δ and ζ respectively refer to the boundaries of large and tiny ratios rather than increments. No overt consensus has developed, however, about the choice of those boundaries. Some epidemiologists, referring to the additional cases of disease that might occur when a huge population is exposed to a “risk factor,” may refuse to consider any elevated risk ratio as tiny or safe. Nevertheless, because of the small numbers of “exposed cases” that are found in many case-control studies of risk, and because of problems in misclassification, in detection, and in enumeration, most epidemiologists will regard ratios below 2 as unimpressive and within the limits of “noise” in the observational system.11 Ratios of 3 or higher, however, are almost always deemed impressive. With the latter criteria, the boundaries for ratios might be set at 3 for δ and at 2 (or something between 1 and 2) for ζ.

The topic of etiologic equivalence for ratios is complex, controversial, and will not be considered further in this chapter. The rest of the discussion is devoted to decisions about increments that have clinical and/or pharmaceutical equivalence.

24.2.2 Use of Single Two-Zone Boundary

Whatever its mathematical merits, the statistical custom of

 

 

 

 

forming a dichotomous two-zone “decision space,” using a

0

ζ

 

δ

single boundary of δ , is a drastic departure from the realities

 

 

 

 

of clinical reasoning.12 Clinicians usually think about a

zone of

 

zone of intermediate

zone of

small or

or inconclusive

“big”

three-zone “space,” shown in Figure 24.2, in which anything

“tiny”

 

distinctions

distinctions

Š δ is big, anything ð ζ is small or “tiny,” and values between

distinctions

 

 

 

 

 

 

ζ and δ are intermediate or inconclusive. The clinical deci-

FIGURE 24.2

 

 

sions are regularly cited in such trichomotous ordinal scales

 

 

Two boundaries (ζ and δ ) forming three categor-

as too high, normal, or too low; positive, uncertain, or

ical zones for magnitude of observed distinc-

negative; hyperglycemic, euglycemic, or hypoglycemic;

tions . [The zones here are for positive

tall, medium, or short.

distinctions. A similar set of zones could be

With the customary two-zone statistical scheme, however,

shown on the left of 0 for negative distinctions.]

 

 

 

 

“equivalence” would receive a relatively large upper bound-

ary if “small” or “equivalent” is regarded as anything less than “big.” This approach would let the same big value of δ be used for deciding either that a new active treatment is more efficacious than standard therapy or, conversely, that a “conservative” or “inexpensive” treatment has essentially the same effects.13,14

© 2002 by Chapman & Hall/CRC

The relatively large size of the boundary for “tiny” is evident in a set of FDA guidelines15 for bioavailability studies, which state (according to Westlake7 ) that “products whose rate and extent of absorption differ by 20% or less are generally considered bioequivalent.” Although a proportionate rather than direct increment, the 20% boundary will often allow the upper level of a small difference to be larger than the δ Š .15 often used in Chapter 23 as the lower boundary of a big direct increment in two proportions.

24.2.3Consequences of “Big” Two-Zone Boundary

When tiny distinctions are allowed big statistical boundaries, several immediate problems occur. The most obvious is that the quantitative criterion is not compatible with ordinary common sense. If someone who is Š 72 inches (183 cm.) is regarded as tall, we are forced to say that anyone whose height is <72 is short. If a fasting blood sugar of Š140 mg/dL is regarded as hyperglycemic, anyone whose level is below 140 would have to be called hypoglycemic. If a diastolic blood pressure of Š 90 mm Hg demarcates hypertension, pressures below 90 become hypotensive.

24.2.4Use of “Small” Two-Zone Boundary

In recent years, a smaller boundary has been proposed10,16 to distinguish zones of equivalence from zones of efficacy. The authors usually employ the same symbol (such as δ ), however, whether the boundary is large or small. To avoid ambiguity, the δ symbol will be reserved here for the big value and ζ will be used for the small one.

The use of a small ζ has at least two important consequences. First (as discussed later), it can substantially raise the sample sizes needed for one-boundary–two-zone calculations of stochastic significance. Second, the introduction of a small ζ also allows construction of a new three-zone approach for the calculations.

24.2.5Directional Decisions

Regardless of whether equivalence is given a large or small boundary, the choice of a scientific direction is particularly important, because it is the source of one-tailed vs. two-tailed decisions in testing stochastic hypotheses.

Do we really want to show that Treatments A and B are similar? If so, the evaluation definitely goes in both directions. Using ζ as the boundary of a small difference, we would want to demonstrate that |A B| ð ζ , regardless of whether the results are slightly larger or smaller for A than for B.

In many other situations, however, we may want to show mainly that A, although perhaps better, is not much more effective than B. Alternatively phrased, we want to show that B is almost as good as A. In this situation, where A might be the somewhat better “brand-name” and B the almost-as-good generic product, the goal would be to find that A B ð ζ . If the results unexpectedly show that B seems better than A, we might be pleased, but we would probably not intend to claim stochastically that B is more effective. In another situation, if “active” treatment A turns out to produce a lower success rate than placebo, we might be happy or unhappy (according to the preceding scientific hopes); but if pA for active treatment is lower than pB for placebo, we probably would not do stochastic tests to confirm the higher efficacy of placebo. (Stochastic testing might be done to demonstrate that A is more harmful than placebo, but the “harm” would be measured with some other variable, and tested with information different from what was used stochastically to compare rates of “success.”)

On the other hand, when two “active” pharmaceutical agents are compared, direction may be important for the incremental magnitude of tolerance within the zone of equivalence. For example, suppose X is the outcome variable chosen for showing equivalence, and suppose XA and XB are its mean values in two compared agents, A and B. Suppose further that A is the standard, “brand-name,” or customary treatment, whereas B is the perhaps inferior, generic, or less costly competing agent. In the selected variable that measures bioavailability, we would expect XA to be greater than XB . To demonstrate equivalence, we would therefore want XA XB to be less than some relatively small positive value.

© 2002 by Chapman & Hall/CRC