many contagious diseases were ascribed to miasmal vapors. In the mid-20th century, thousands of premature babies were blinded during an iatrogenic epidemic of retrolental fibroplasia,1 caused by excessive oxygen therapy given as prophylaxis in the belief that its potential pulmonary benefits could not be accompanied by adverse effects elsewhere in the body.
23.1.1.3 Erroneous Interpretations — When the hypotheses emerge from the data, a correct set of information may be erroneously interpreted.
A major blunder in confusing association with causation was committed by William Farr (the respected “founder” of Vital Statistics) who concluded, from statistical correlations, that cholera was caused by high atmospheric pressure.2 In the early 20th century, pellagra was erroneously regarded as infectious after it was commonly found in members of a family and in their neighbors.3
23.1.2Statistical Problems
Common statistical sources of error have been the failure to set quantitative boundaries, to appraise stochastic variation, and to understand dissident results.
23.1.2.1 Boundaries for Quantitative Distinctions — As the use of statistical data became popular in the 20th century, a subtle source of error was incorrect beliefs about magnitude. If Treatment A was believed “better” than B, but turned out “worse,” the scientific hypothesis itself was wrong. If A was only slightly or trivially better, however, the scientific hypothesis was still correct, but the anticipated magnitude of difference was wrong.
To be examined statistically, such concepts as better, worse, or trivially better must be converted to quantitative expressions. The scientific difficulty of choosing these expressions was discussed in Chapter 10. Qualitatively, a particular phenomenon (death, “success,” relief of symptoms, etc.) must be chosen as the focus of attention; and a particular statistical index (increment, direct ratio, proportionate increment, etc.) must be chosen to cite the quantitative distinction, do, observed in the compared treatments. The next statistical step is to demarcate a magnitude that will make this distinction be regarded as “big” or “small.”
The choice of these boundaries is often difficult, but they must nevertheless be established to allow statistical procedures to be used, before the research, for calculating a suitable sample size, and afterward, to decide whether the observed distinctions are impressive or unimpressive enough to warrant stochastic “tests of significance.”
23.1.2.1.1 Demarcations of “Big” and “Small”. Suppose the symbol δ is used for the lower boundary of a big or quantitatively impressive distinction. If Treatment A is expected to be substantially better than Treatment B, the scientific hypothesis might be cited symbolically as
A − B ≥ δ
If Treatments A and B are expected to be essentially equivalent, the difference in results will seldom be exactly zero. Consequently, an upper limit, expressed with the symbol ζ, can be established as the maximum magnitude of a small or quantitatively insignificant distinction. The scientific hypothesis of equivalence, or a tiny difference, could then be quantitatively cited as
A − B ≤ ζ
23.1.2.1.2 Effects of Quantitative Boundaries. The quantitative boundaries for δ and ζ are arbitrary, but no more arbitrary than the level of .05 usually set for the boundary of α in stochastic tests. With two sets of boundaries available for quantitative and stochastic decisions, however, a gallery of statistical errors becomes possible. Most of this chapter is concerned with possible errors in stochastic decisions when the investigator wanted to find a “big” distinction, i.e., do ≥ δ. The stochastic examination of “small” distinctions, i.e., the confirmation of “equivalence,” is discussed in Chapter 24.
© 2002 by Chapman & Hall/CRC
23.1.2.2 Stochastic Variations — Stochastic variation refers to phenomena that can occur during the action of random chance. Wrong conclusions can occur if these variations are not recognized and suitably accounted for. For example, someone who wins the main prize in a lottery might become rich, but would not immediately be regarded as a talented selector of random numbers. In fact, if the same person wins again and particularly a third time, we might believe something is wrong with the lottery process. On the other hand, if a 12 is tossed with two dice, the chance probability of the occurrence is (1/6)(1/6) = 1/36 = .028, but we would not immediately reject the idea that the dice are “fair.” The event would regularly happen among a large series of consecutive tosses at a gambling casino’s dice table.
23.1.2.2.1 “False Positive” Conclusions. The customary “tests of statistical significance” are done to avoid erroneous “false positive” conclusions that might arise merely from stochastic variation; and α is set at the level of acceptable errors. Thus, an impressively large incremental success rate could readily arise by stochastic variation alone if two treatments, A and B, are actually equivalent, but produced pA = 8/13 = .615 and pB = 4/12 = .333, with do = .282, in a study done with small groups.
For these small numbers, the stochastic test of the null hypothesis is best done with the Fisher exact procedure, which would produce 2P = .238. The mathematical principles, however, are easier to illustrate with the Z test. The standard error of the difference is first calculated with Formula [14.11] as
SED = [(8 + 4)(5 + 8)] ⁄ [(13 + 12)(13)(12)] = 0.2
The observed value of Z, designated as Zo, is then calculated as .282/.2 = 1.41, for which 2Po = 0.159. Although this P value is smaller than the result of the Fisher test, neither procedure would lead to rejection of the null hypothesis with α set at .05.
23.1.2.2.2 “False Negative” Conclusions. Stochastic variation, however, might also lead to an erroneous “false negative” conclusion. For example, suppose two treatments in a clinical trial really differ substantially (i.e., by at least δ = .15), and suppose at least 25 patients had been entered into each group. If the subsequent results then show that pA = 14/29 = .483 and pB = 12/28 = .429, the value of do = .483 − .429 = .054 is less than δ = .15. This result is not quantitatively significant and is also not stochastically significant with a test in which we first calculate
SED = [(14 + 12)(15 + 16)] ⁄ [(57)(29)(28)] = .132
The value of Zo is then .054/.132 = .409, for which 2Po = .68.
To avoid a false negative conclusion, however, we can first check to see whether the observed small value of do is a stochastic variation, which differs by chance from the true value of δ ≥ .15. An obvious way to check for the latter possibility was discussed in Sections 11.7.2 and 11.9. We determine whether δ = .15 is included in the upper boundary of an appropriate confidence interval around do = .054. With
.132 as the value of SED, the 95% confidence interval is .054 ± (1.96)(.132) = .054 ± .258. It extends from −.205 to .313. The result is not stochastically significant because the value of 0 is included, but the confidence interval also includes the value of .15. The result might therefore be a stochastic variation from a true difference, between treatments A and B, that is actually as large as δ = .15.
Because the confidence interval component, Zα × SED, is .258 here, the value of δ = .15 would be included in the upper part of the confidence interval for all positive values of do. The interval would fail to exceed δ only if do < δ − Zα(SED), which would occur when do has a negative value of at least .150 − .258 = −.108.
23.1.2.2.3 Clinical Claims of “No Difference”. The situation just described constantly occurs in medical literature when the investigator gets the observed data, finds that the customary P value exceeds the α level of “significance,” and then concludes that the study had “nonsignificant” results.
If a confidence interval was not published to show how large the “nonsignificant” difference might have been (or sometimes even if the confidence interval was shown), irate readers will regularly send letters to the editor complaining about the omission. The readers usually contend that the group sizes
© 2002 by Chapman & Hall/CRC
were too small to prove the claim. The original authors may then respond by citing the confidence intervals (which may often include a “big” result), but offering various justifications for the claim of nonsignificance.
Such arguments have occurred after publication of clinical trials claiming that early discharge from hospital was relatively safe after acute myocardial infarction,4 that glucagon injections did not improve accuracy of a double-contrast barium enema,5 that exchanging unsaturated fats did not affect plasma lipoproteins6, that thoracic radiotherapy did not prolong survival in patients with carcinoma of the lung,7 and that ritodrine (a beta-adrenergic agonist) was not effective in treating preterm labor.8 If a big difference is included, the upper boundary of the confidence interval can readily be used to justify the contention that such a difference might exist.
23.1.2.3 Statistical Dissidence — The quantitative and stochastic decisions agree if the observed distinction seems quantitatively significant, and if the stochastic test confirms the significance. Statistical dissidence occurs when the two sets of results do not agree, so that significance is found quantitatively but not stochastically, or stochastically but not quantitatively. The conclusion will be wrong if a correct quantitative distinction is ignored in favor of the contradictory stochastic result.
The quantitative-yes–stochastic-no type of dissidence was frequently noted in scientific literature as stochastic tests became increasingly used to prevent erroneous conclusions from small groups. The dissidence occurs when the group size is too small to allow stochastic confirmation for a big quantitative distinction. Without the stochastic test, an investigator might claim significance when a quantitatively impressive increment of 15% in success rates of 25% vs. 40% came from numbers as small as 1/4 vs. 2/5. “Tests of significance” were introduced and intended to prevent this problem. The opposite type of quantitative-no–stochastic-yes dissidence, which has become increasingly common when stochastic tests are used as the main or only basis for scientific conclusions, is the stochastic proclamation of significance for a small, unimpressive quantitative distinction.
23.1.2.3.1“Boundless Significance” and Oversized Groups. The enormous impact of size in tested groups was shown in earlier discussions of the Z, t, and chi-square tests. If the groups are too small, an impressive quantitative distinction may not be stochastically confirmed; but if the groups are too big, an unimpressive distinction may become stochastically significant.
The latter type of statistical dissidence can occur because the customary calculation of stochastic significance is “boundless.” The quantitative boundary of δ is neither used nor needed for determining a conventional P value from Zo = do/(SED) or a confidence interval constructed as do ± Zα (SED). If the calculated Zo exceeds Zα, or if the confidence interval excludes 0, the stochastic result can be proclaimed
“significant,” regardless of the actual magnitude of do. For example, in a study that contains more than 2200 persons in each group, the rates of “success” may be 750/2207 = .34 in Group A and 819/2213 =
.37 in Group B. The increment of .03 in the two groups may seem small and unimpressive, but it is stochastically significant at P < .05, because the big groups lead to a suitably large value of 2.1 for Zo.
There is no statistical method to prevent erroneous conclusions in this situation. They can be avoided (or “cured”) only if investigators (and readers) preserve their scientific judgment and examine the actual magnitude of the observed distinction. If it is not big enough to be impressive, it is not “significant” even if the P value is infinitesimal.
23.1.2.3.2Problems of Undersized Groups. The stochastic dissidence caused when undersized
groups are too small to allow rejection of the null hypothesis was illustrated in Section 23.1.2.2.1. In a
well-conducted clinical trial that produced pA = 8/13 = .615 vs. pB = 4/12 = .333, the investigator could not get stochastic confirmation for the impressively large quantitative increment of do = .282.
This problem, although regularly regarded as a defect in “power” of the trial, is actually due to a simpler defect in what might be called capacity. As noted later, power refers to the ability to reject an alternative hypothesis that the quantitative distinction is large although the observed result may be small. Capacity, however, refers to the ability to reject the original null hypothesis when the observed distinction is large.
©2002 by Chapman & Hall/CRC
23.2 Calculation of Capacity
The statistical dissidence just described occurred because the group sizes were too small. If the investigator had really expected to find an increment as large as .282 between the two treatments, the necessary sample size for stochastic significance at a two-tailed P < .05 could have been calculated with the earlier Formula [14.20], using πB = .333, to get
n ≥ (2)(.333)(.667)(1.96)2/(.282)2 = 21.5
At least 22 patients would have been required in each group. With Formula [14.19], for which π would be estimated as (.615 + .333)/2 = .474, the sample size needed for each group would have been
n ≥ (2)(.474)(.526)(1.96)2/(.282)2 = 24.1
or at least 25. With either calculation, the actual group sizes of 13 and 12 would lack the capacity to achieve a stochastic 2P < .05.
If δ = .15 had been originally chosen as a boundary for quantitative significance, the sample size required by Formula [14.19] would have been
n ≥ 2(.4065)(.5935)(1.96)2/(.15)2 = 82.4
With at least 83 persons in each group, the quantitatively impressive do = .282 would easily have yielded 2P < .05.
The cited stochastic defect in capacity can easily be quantified numerically. The foregoing calculations (where do = .282 and π was estimated as .474) showed that about 25 persons were required for each group, making a total required size of NR = 50. Because the actual group, No, contained 13 + 12 = 25 people, the capacity was approximately No/NR = 25/50 = 50% of what was needed.
The main point to be noted, however, is that the trial under discussion was defective in its basic capacity, not in its power to reject an alternative hypothesis, as discussed shortly.
23.3 Disparities in Desired and Observed Results
In a study that begins with the goal of finding a big difference, do ≥ δ, three possible outcomes can occur. In the first two, the result is “positive,” with do ≥ δ. This “positive” result is then either confirmed stochastically, with Po ≤ α, or not confirmed, with Pο > α. In the third situation, the result of the trial is “negative,” with do < δ. In this situation, the investigator hopes that the small do is stochastically consistent with a big δ.
23.3.1General Conclusions
An observed “positive” big distinction, i.e., do ≥ δ, would be stochastically confirmed if the group size had full capacity, but would not be confirmed, with P > α, if the group size was too small. If the result was “negative,” i.e., do < δ, the small do might be rendered stochastically significant if the group size was huge; but in most reasonable situations, the associated Po would exceed α, and the distinction would be nonsignificant, both quantitatively and stochastically.
In the last situation, however, an investigator who wanted to find do > δ would be delighted if δ were included in the upper end of the confidence interval for do. After savoring the delight, however, a cautious investigator might have a nagging doubt. Suppose the scientific hypothesis is wrong, so that do is really small, rather than being merely a stochastic variation from δ. Worried about this possibility, the investigator may now want some further stochastic reassurance that can prove or confirm “no difference.” Thus, the investigator would ask, “How can I be sure I have not been deluded by random fate? What would be convincing evidence that the treatments have a really small difference?”
© 2002 by Chapman & Hall/CRC
23.3.2Group Size for Exclusion of δ
Scientifically, the immediate answer to the latter questions is “Repeat the trial.” Statistically, however, a numerical solution can be offered. If a suitable confidence interval excludes the “big” value of δ, the demonstration that
could be reasonably assuring. It would indicate that the “small” value of do is probably not merely a stochastic variation from a true large value of δ.
To determine the group size required for this assurance, we first convert Formula [23.1] to Zα(SED) < (δ − do). Assuming equal group sizes, SED can then be calculated as 2πˆ (1 – πˆ ) ⁄ n . After the algebra is developed, the size of n needed in each group will be
2 |
ˆ |
ˆ |
[23.2] |
n > Z-----------------------------α2 |
π(1 – π) |
(δ – dO )2 |
|
To illustrate the calculation, suppose we assume that do will be .054, as in Section 23.1.2.2.2. We determine πˆ as the average of the estimated pA and pB, which is (.483 + .429)/2 = .456, so that 1 – πˆ =
.544. We set Zα = 1.96 and then substitute in [23.2] to get
n > [(1.96)2(2)(.456)(.544)]/(.150 − .054)2
which turns out to be 1.906/(.096)2 = 206.8.
Thus, if a trial with 207 patients in each group yields the expected pA = .483, pB = − .429, and do =
.054, the value of δ = .15 would be excluded from the confidence interval. The investigator could then be able, with 95% confidence, to conclude that Treatment A does not exceed Treatment B by a difference of δ = .15 or more.
If things turn out almost exactly as anticipated in the preceding paragraph, the results with the larger sample size will be pA = 100/207 = .483 and pB = 89/207 = .430. The SED will be
[(89 + 100)(107 + 118)] ⁄ [(414)(207)(207)] = .049
The 95% confidence interval will be (.483 − .430) ± (1.96)(.049) = .053 ± .096; and it will extend from −.043 to .149. With 0 included, the result is not stochastically significant; and with .150 excluded, the result offers stochastic assurance that do is unlikely to be as large as .150. The investigator could now conclude that the original scientific hypothesis was probably wrong. Despite the desired hope, Treatment A is not substantially better than Treatment B.
23.4 Formation of Alternative Stochastic Hypothesis
The foregoing tactics brought us into a new type of stochastic reasoning. In everything done until now, we found a “big” difference, do, that was scientifically expected and welcome, so we tried to confirm it stochastically. The stochastic hypothesis, symbolized as ∆, was assumed to be the opposite of what we wanted to prove. We made ∆ as small as possible, i.e., ∆ = 0.
For the new situation, however, we want stochastic confirmation for a “small” difference, i.e., do < δ. We therefore want to reject a different hypothesis, i.e., that ∆ ≥ δ. If δ is excluded from the corresponding confidence interval, we could conclude stochastically that the observed result, do, is indeed smaller than δ.
This conclusion, however, would reverse the original goal of the trial, which was done with the hope of finding do ≥ δ. The reversal is not important for the statistical procedures that follow in the next few
© 2002 by Chapman & Hall/CRC
sections, but becomes a crucial feature of the reasoning when we reach the Neyman-Pearson strategy considered in Section 23.6.
23.4.1Statement of Alternative Hypothesis
The conventional null hypothesis places the value of ∆ at 0 for increments, correlation coefficients, or slopes, and at 1 for a ratio. To avoid frequently repeating the comment about “1 for a ratio,” all null hypotheses about “equivalence” or “no distinction” will hereafter be cited as 0. The same ideas and approaches will also pertain, if expressed for a ratio, but the null hypothesis will be 1.
With ∆ representing the value of the stochastic hypothesis, the conventional “null” assumption, ∆ = 0, was stated (for two proportions) as Ho : πA − πB = 0. In the new procedure, however, the hypothesis to be rejected is the alternative statement that ∆ ≥ δ. The symbols would be HH : πA − πB ≥ δ.
23.4.1.1Imprecise Counter-Hypothesis — In the logic of stochastic testing, a primary hypothesis can be rejected or conceded, but never accepted. The primary hypothesis is therefore set to
be the opposite of what we would like to conclude; and when the hypothesis is rejected, we concede the counter-hypothesis. To be the direct opposite of the null hypothesis, ∆ = 0, the counter-hypothesis must be imprecise, without a stipulated focal point. If ∆ = 0, the counter-hypothesis can be either a twotailed ∆ ≠ 0, or in a one-tailed direction, ∆ > 0 or ∆ < 0; but it cannot be ∆ = δ.
This logic is responsible for the “boundless significance” discussed in Section 23.1.2.3.1. Suppose δ = .15 is set as the level of quantitative significance for an increment in two proportions, and suppose
the results show that pA = 27/42 = .64 and pB = 11/39 = .28. For the quantitatively significant distinction of do = pA − pB = .36, the value of Zo turns out to be 3.25. Although the observed value of do can now be deemed stochastically significant at a two-tailed P < .05, the actual stochastic conclusion is only that ∆ ≠ 0. The observed do = .36 acquires its label of “stochastic significance” merely by being compatible with the stochastic conclusion. This same conclusion could have been obtained with adequately large
group sizes if do were smaller, at .19. The observed do could even be stochastically significant when smaller than δ, at values of .10, .07, or .03. For example, suppose pA = 238/2166 = .11 and pB = 80/2184 =
.04, so that do = .07. This result is substantially smaller than δ = .15, but it would produce Zo = 2.5, for
which 2P < .05. Despite the relatively small do, we can still reach the same stochastic conclusion, i.e., ∆ ≠ 0, as with the previous big do.
23.4.1.2Precise Location for Alternative Hypothesis — Unlike a counter-hypothesis, the alternative hypothesis has its own specific dignity and focal location. Like any other stochastic hypothesis, the alternative hypothesis can be rejected or conceded but not accepted. When considered for the
possibility of being false, i.e., rejected, the alternative stochastic hypothesis must have a precisely specified location, analogous to the precision of ∆ = 0.
If the goal is to get stochastic confirmation that do < δ, the precise value of the alternative hypothesis is usually set at δ. With HH as the symbol, the alternative hypothesis becomes expressed as HH : ∆ = δ; and its counter-hypothesis becomes ∆ ≠ δ. For reasons to be cited shortly, the alternative hypothesis is
almost always checked in a one-tailed direction. The appropriate statement would then be HH : ∆ ≥ δ; and the counter-hypothesis would be ∆ < δ. For simplicity of expression and calculation, however, the
usual statement is simply HH : ∆ = δ. The directional issues are implicit when the results are interpreted. In a contrast of two proportions, pA and pB, the parametric alternative hypothesis is HH : πA − πB = δ, or (if specifically two-tailed) HH : |πA − πB| = δ. In a contrast of two means, XA and XB , the same
principles are used, but the parameters in the hypothesis are µA and µB.
With this operating principle, we can explore what has been called “the other side of statistical
significance,”9 by considering what happens if the original null hypothesis is false and should have been rejected. Its falsity is explored with the alternative stochastic hypothesis, for which δ replaces the null-
hypothesis value of 0. Under the alternative hypothesis, the observed value of do is examined as the increment of δ − dο.
©2002 by Chapman & Hall/CRC
FIGURE 23.1
Location of do in reference to distributions for original null hypothesis (upper drawing) and for alternative hypothesis (lower drawing).
Location of do
Location of do
23.4.2Alternative Standard Error
Under the alternative hypothesis (as under the null hypothesis), the increment in two means or in two proportions continues to have a theoretical Gaussian (or Gossetian) sampling distribution for values of Z (or t). Figure 23.1 shows the location of the observed do and the potential Gaussian distributions of increments
under each of the two stochastic hypotheses. ∆ = 0 As noted earlier, the standard error of a dif-
ference in two central indexes, symbolized as SED, is calculated differently when ∆ = δ rather than ∆ = 0. For contrasting two observed proportions with ∆ = 0, Formula [15.9] for SED is
|
SEDo = NPQ ⁄ nA nB |
|
|
|
|
|
but for ∆ ≥ δ, the analogous calculation in |
|
|
|
|
|
|
|
|
|
|
∆ = 0 |
∆ = δ |
|
Formula [15.12] is |
|
|
|
|
|
SEDH = (pA qA ⁄ nA ) + (pB qB ⁄ nB )
Both calculations can be eased with the “shortcut” formulas shown earlier in Expressions [14.11] and [14.13].
The actual difference between the two SEDs, however, is usually small and inconsequential (see Section 14.4.2). For example, although the SED was calculated as SED0 in Section 23.1.2.2.2, the procedure was aimed at rejecting the alternative hypothesis that ∆ ≥ δ. Consequently, the calculation should have used SEDH, which would have been (14)(15) ⁄ 293 + (12)(16) ⁄ 283 = .132. The result (at three decimal places), however, is the same as the previously calculated SED0 = .132.
In many ensuing discussions in this text, the SED symbol will be used in a general way, regardless of which formula it comes from. For illustrative calculations, SEDH will be used when it is particularly pertinent, but SED0 will often be preferred because of its greater general applicability. A single calculation for SED0 has the advantage of letting the same confidence interval sometimes be used, as in Section 23.1.2.2.2, for checking both the null hypothesis (in the lower boundary) and the alternative hypothesis (in the upper boundary).
23.4.3Determining ZH and PH Values
Using the alternative hypothesis, the symbols ZH and PH will correspond to the Zo and Po obtained with the ordinary null hypothesis. For comparing two groups, the alternative Z values will come from the formula
|
ZH = (δ − do)/SED |
[23.3] |
|
With the alternative SEDH calculated for two proportions, the formula will be |
|
|
ZH = |
δ – d o |
|
[23.4] |
|
---------------------------------p A q A + p B q B |
|
----------- ---------- |
|
|
|
n A |
nB |
|
© 2002 by Chapman & Hall/CRC
True Negative
Conclusion ( 1 − α)
For example, in a clinical trial where pA = 9/18 and pB = 8/17, so that do = .500 − .471 = .029, the value of SEDH under the alternative hypothesis will be
[(9)(9)/183 ] + [(8)(9) ⁄ 173 ] = .169
If δ is designated as .15,
.15 – .029
ZH = ----------------------- = .716
.169
The values of ZH are interpreted as P values in exactly the same way as under the conventional null hypothesis. At the Gaussian value of ZH = .716, the two-tailed PH is .47. Thus, there is a two-tailed chance of .47, and a one-tailed chance of .235, that the observed result of do = .029 came from a population in which the true difference was as large as .15.
23.4.4Role of β
For the original null hypothesis, the α level establishes the boundary of α-error or Type I error for false positive conclusions if a correct hypothesis is rejected. For the alternative hypothesis, a corresponding level, called β, establishes the boundary of β-error or Type II error for the relative frequency of wrong decisions if a correct alternative hypothesis is rejected. If HH is true, its rejection would lead to the false negative conclusion that the two groups are not substantially different, when in fact they are.
Table 23.1 shows the use of α and β
|
levels in reasoning for the original null- |
TABLE 23.1 |
|
|
|
α, β, and Accuracy of Stochastic Decisions |
|
hypothesis decision. If the null hypothesis |
|
that ∆ = 0 is correct, there is an α chance |
for Null Hypothesis |
|
|
|
that rejection is wrong, and a 1 − α chance |
|
|
|
|
|
Conclusion |
|
|
|
that concession is right. If the true state of |
RE Stochastic |
True State of Reality |
|
affairs is ∆ = δ, however, concession of the |
Hypothesis |
|
|
original null hypothesis has a β chance of |
That ∆ = 0 |
∆ = δ |
∆ = 0 |
|
being wrong, and rejection has a 1 − β |
Reject |
True Positive |
False Positive |
|
chance of being correct. |
|
|
Conclusion |
Conclusion |
|
|
|
|
(1 − β) |
(α) |
23.4.5Analogy to Diagnostic Marker Tests
Concede |
False Negative |
|
Conclusion |
|
(β) |
The statistical parlance does not use the language of diagnostic marker decisions,
but the concepts are almost identical. Suppose a pap smear is done as a diagnostic marker test for a cancer. If the pap smear result agrees with the definitive tissue biopsy, the pap smear conclusion is either a true positive or true negative. If the pap smear and biopsy disagree, the original conclusion is either falsely positive or falsely negative.
23.4.5.1 False Positive Conclusions — If the null hypothesis is rejected with Po < α, there is still a probability of Po that the rejection is wrong. The selected value of α is the upper boundary of risk for the false positive conclusion. Thus, if α is set at .05 and stochastic significance is proclaimed when Po = .049, the two groups may still be truly similar, and the probability is .049 that the decision is wrong. With α set at a higher level of .1, the quantitative range of false positive conclusions is expanded. The two groups might really be similar and the decision that they are different might be wrong in .03, .06,
.08, .09, or .099 of the occasions when the null hypothesis is rejected at the corresponding values of P < .1. When α is set as the boundary of “risk” for a false positive decision, the level of 1 − α is analogous to the specificity of a diagnostic test. In previous usage, 1 − α helped establish the boundaries of a
© 2002 by Chapman & Hall/CRC
confidence interval. In the application here, 1 − α helps denote the relative “confidence” attached to a stochastic decision to concede the null hypothesis.
23.4.5.2False Negative Conclusions — If the original null hypothesis is rejected as false, we
infer that the parent universe has a big distinction (at least as big as the observed do), rather than none. If the null hypothesis is conceded, however, and if the parent universe really does have a big distinction, the concession will be a false-negative conclusion. Because β is set as the permissible frequency of false-negative conclusions, the value of 1 − β is analogous to the sensitivity of a diagnostic test.
23.4.5.3Role of Horizontal “Gap” — When “vertical” indexes of sensitivity and specificity are used in diagnostic decisions (see Chapter 21), we cannot immediately make “horizontal” appraisals of accuracy, because the prevalence of diseased cases will vary in different clinical situations. An analogous problem prevents horizontal conclusions in stochastic decisions, but the problem does not
arise from prevalence. The stochastic problem is caused by the numerical gap that separates ∆ = 0 from ∆ = δ in Table 23.1. If the observed value of do lies in the intermediate zone where 0 < do < δ, we might have to concede (or reject) both the original null and the alternative hypotheses.
23.4.6Choice of β
Stated as ∆ ≥ δ, the alternative hypothesis obviously has a clear direction and could therefore be tested with a one-tailed choice of β. Accordingly, for a .05 level of rejection, Zβ could be set at Z.1 = 1.645.
The concept becomes important if confidence intervals are used to examine both the null and the alternative hypotheses. In previous examples, this examination was done with a “single” arrangement, constructed as
do ± Zα(SEDo )
A more accurate approach, however, would require two arrangements:
do – Zα(SEDo )
would be used to locate the lower border, and
do + Zβ(SEDH )
would indicate the upper border.
If the alternative hypothesis is ∆ ≥ δ, a one-tailed 95% confidence interval can be used to check the upper border, which will be enlarged if calculated with a two-tailed Zα, rather than with a one-tailed Zβ. Unless the original null hypothesis was clearly expressed in advance as ∆ > 0 or ∆ < 0, however, a onetailed calculation is not appropriate for the lower border of the confidence interval.
[This distinction led to a major legal battle between the U.S. tobacco industry and the Environmental Protection Agency (EPA), which had done a meta-analysis of results for lung cancer attributed to environmental tobacco exposure (i.e., “passive smoking”). Certain crucial odds ratios that were not stochastically significant in two-tailed 95% confidence intervals, calculated with Zα = 1.96, became “significant” when the EPA’s 90% confidence intervals, calculated with Zα = 1.645, excluded the null value of 1 from the lower border. The tobacco industry contended that substituting 90% for the customary 95% criterion was a political rather than scientific decision. (The argument included other scientific disputes beyond the accusation of “rigged” confidence intervals.)]
Both the lower and upper margins of confidence intervals should be examined when investigators either claim stochastic significance in rejecting ∆ = 0, or argue that the upper level of “risk” might be much higher than what was found in the observed do. Rejection of ∆ = 0 is easier if the interval is calculated with a one-tailed Zα; the converse claims of a potentially larger do are facilitated with a two-tailed Zβ.
© 2002 by Chapman & Hall/CRC
23.5 The Concept of “Power”
The unfamiliar term capacity was used in Section 23.2 to refer to group sizes that were too small to do the desired job of rejecting the original null hypothesis when do was “big.” Capacity is an unfamiliar word, because statisticians regularly use the term power in reference to the adequacy of group (or sample) sizes. The idea of power, however, refers to the ability to reject the alternative, rather than the null, stochastic hypothesis.
23.5.1Statistical Connotation of “Power”
In the customary “test of significance,” a big distinction has been observed, and the stochastic question is “How small might it have been?” If the Po value exceeds α, or if the lower end of a 1 − α confidence interval includes the null hypothesis value of 0, the distinction is not stable enough for its “quantitative significance” to be confirmed stochastically.
The “other side of statistical significance” is examined when the observed distinction is small, or obviously not big, i.e., do < δ. The stochastic question is then “How large might it have been?” This question can be answered with a direct counterpart of the former reasoning. If the PH value exceeds β, or if the upper end of a 1 − β confidence interval includes the alternative hypothesis value of δ, the quantitative “nonsignificance” is not confirmed. Although small, the observed distinction might really be big. Thus, rejection of the alternative stochastic hypothesis is intended to confirm that the observed small distinction is really small.
Although the ability of a group size to reject a stochastic hypothesis is often called “power,” the statistical definition of power is much more constrained. When δ and β are set in advance, 1 − β is called the statistical power to reject the alternative hypothesis that ∆ ≥ δ. This prospective concept of power is also sometimes applied in retrospect, after a study is completed. When ZH is determined from Formula [23.4] and converted to PH, the value of 1 − PH may be called “power.”
The latter usage of “power” has been vigorously disputed,10 however, because the strict definition requires a single boundary value of δ that was designated before the research began. This “prospective” boundary is often not established, however; and in its absence, the investigator or data analyst can
retrospectively make various choices of δ. Each choice would yield different results for ZH and for the |
1 – PH value of “power.” For example, consider the clinical trial in Section 23.4.3, where PA = 9/18, |
PB = 8/17, do = .029, and SEDH = .169. When δ was chosen to be .15, ZH = . 716, and the one-tailed |
value of 1 − PH was 1 – .235 = .765. If δ is set at .10, ZH = .420, 2PH = .674, and 1 − PH = .663. If δ is |
set at .20, ZH = 1.01, 2PH = . 312, and 1 − PH = . 844. To get an impressively high value of power, we |
could set δ at .32. ZH would then be 1.72, 2PH |
= .085, and 1 − PH = .9575. |
With these arbitrary retrospective choices of |
δ, however, “power” would become a type of “variable,” |
rather than a distinctive, fixed attribute of the study. The opponents of this retrospective manipulation of “power” argue that the best way to answer the retrospective question, “How big might it have been?” is with an appropriate confidence interval (or perhaps a form of Bayesian strategy). Thus, in the foregoing example, Zβ would be 1.645 for the upper end of a one-tailed 95% confidence interval, constructed around do as
.029 + (1.645)(.169) = .307.
With this result, we could “rule out” the possibility that do was as large as .32, but not that it might be as large as .30.
23.5.2 Comparison of “Capacity” and “Power”
Because power refers to the alternative hypothesis, the term capacity was introduced here for the ability of a group (or sample) size to achieve “single significance” by rejecting the original stochastic hypothesis. If the scientific goal of the research is to find something big, the original stochastic hypothesis is ∆ = 0;
© 2002 by Chapman & Hall/CRC