14.4.3.2 Focus of Conclusions — In Chapter 10, we saw how an investigator can honestly and legitimately, but somewhat deceptively, make unimpressive results seem impressive. Instead of being cited as unimpressive increments, they can be cited as impressive ratios or proportionate increments. An analogous tactic is available when the investigator chooses Zα and emphasizes a directional focus for the confidence interval.
If the confidence interval includes the null hypothesis value of 0, the result is not stochastically significant. In this circumstance, many investigators (especially if the group sizes are small) will examine the ends of the interval to see how big pA pB might really be. Such an examination is quite reasonable, but it may sometimes be used zealously by an investigator who is committed to advocating a “significance” that was not confirmed stochastically. The investigator may emphasize and focus on only the large potential difference shown at one end of the confidence interval. If Zα is enlarged, the difference can be made even larger. The other end of the interval, of course, will indicate a reverse possibility—but its existence may then be conveniently ignored. (The prevention of this type of stochastic abuse was one of the reasons cited by Fleiss8 for insisting on P values.)
In other situations, as discussed earlier in Chapter 13, when the null hypothesis parameter is excluded from the conventional 95% confidence interval, an investigator who is avidly seeking “significance” can ecstatically proclaim its presence. A cautious analyst, however, will examine both the range of the confidence interval and its smaller end. If the range is too wide and the smaller end is close to the null-hypothesis value, the relatively unstable result may have “just made it” across the stochastic border. A “fragility shift” of one or two persons might often drastically alter the result and even remove the “significance.”
14.4.4Criteria for “Bulk” of Data
If a confidence interval is being used for descriptive decisions about how big a difference might be, and if Zα is regularly chosen to be the same value of 1.96 used for a two-tailed α = .05, the stochastic tail of the conventional α is allowed to wag the descriptive dog of quantitative significance.
Arguing that descriptive significance depends on bulk of the data, rather than arbitrary stochastic boundaries, some analysts may prefer to use Z values that reflect differences established by bulk of data, not by stochastic doctrines about level of α . For example, a two-tailed Z.75 = 1.15 will span 75% of the Gaussian increments, and Z.50 = .6745 will span 50%. The inner 50% of the data is the interquartile range spanned by a box plot. In the early days of statistical inference, before α was set at .05 by R. A. Fisher’s edict, the 50% zone of .6745 × standard error was regarded as the “probable error of a mean.” The investigators were satisfied with the idea that a mean might reasonably vary within this 50% zone, and they did not demand a larger zone that spanned 95% of the possibilities.
If this same idea is used for a descriptive possibility regarding the upper boundary of a “nonsignificant” increment, the confidence interval might be better calculated with Zα set at smaller values of .6745 (or perhaps Z.75 = 1.15), rather than with the arbitrary 1.96. For example, when the previous contrast of 18/24 = .75 vs. 10/15 = .67 failed to achieve stochastic significance, the potential interval ranged from
.21 to +.37 if calculated with the Gaussian Zα = 1.96. With Zα set at a descriptive .6745, however, the interval would be .08 ± (.6745)(.150) and would extend from .02 to .18. We are not concerned about the low end here, but the upper value of .18—which produces an interval spanning 50% of the potential increments—might be a more realistic estimate for the descriptive range of the upper boundary in denoting how much larger pA might be than pB.
This type of estimate may become attractive when data analysts begin to give serious attention to descriptive rather than inferential decisions about “significance” in statistical comparisons.
14.5 Calculating Sample Size to Contrast Two Proportions
Advance sample sizes are usually calculated, before the research begins, to avoid the frustration of finding quantitatively impressive results in group sizes too small to achieve stochastic significance. The
© 2002 by Chapman & Hall/CRC
calculation seems to use a relatively simple formula noted later, but the procedure involves decisions, assignments, and directional estimates that have important but often overlooked consequences. In most situations, the investigator expects to observe do = pA pB as an increment in two proportions, and also hopes that do will exceed δ , which is the lower level of quantitative significance. After choosing an α level of stochastic significance, the investigator wants a sample size whose capacity will produce P < α , or a suitably small 1 α confidence interval.
14.5.1“Single” vs. “Double” Stochastic Significance
In most trials, the investigators would like to get stochastic significance at the chosen level of α for an observed do having a “big” value that is ≥ δ . In the past two decades, however, statisticians have worried about a different problem that occurs when the anticipated do turns out to be smaller than δ . The stochastic conclusion that such results are nonsignificant may then be a “Type II” or “false negative” error. These conclusions differ from the “Type I” or “false positive” errors that are presumably avoided by a suitably small choice of α. Avoidance of Type II errors involves considerations of statistical “power” that will be discussed in Chapter 23. The main point to be noted now, however, is that an investigator who wants to avoid both false positive and false negative errors will enlarge the sample size substantially by seeking “double stochastic significance” at the level of α and also at a later-discussed level of β .
The rest of the discussion here describes the conventional methods of determining sample sizes that will achieve “single significance” at the chosen level of α . For the customary “single-significance” null-hypothesis contrast of two proportions, sample sizes are calculated with the Z procedure, which yields the same results when used with either a P-value or confidence-interval strategy. The “prospective” calculations are conceptually more difficult than “retrospective” appraisals after the research is completed, however, because no data are available in advance. Everything must be assigned and estimated. In addition to choosing assignments for δ and α, the investigator must estimate both π B and the common value of π .
14.5.2Assignments for δ and α
Sample-size decisions begin with two assignments. The first is to set a boundary for δ , the magnitude of the impressive quantitative distinction desired (or expected) between p A and pB. As noted earlier (Chapter 10), this choice will depend on the conditions being investigated.
The second assignment sets a level for α. This decision is crucial in both prospective and retrospective stochastic activities because the corresponding level of Zα is used prospectively to calculate sample size, and retrospectively either to evaluate an observed Z o for “significance” or to calculate the magnitude of the confidence interval.
As noted in earlier discussions, the choice of α involves decisions for both magnitude and direction. The customary boundary, .05, can be made larger or smaller in special circumstances (such as the multiple-comparison problem mentioned in Chapter 11). The customary direction is two-tailed, so that the Gaussian value of Zα = Z.05 = 1.96. If the direction is one-tailed, however, the level of α is doubled, and Z2α = Z.10 = 1.645.
14.5.3Estimates of π A, π B, and π
The next decisions involve three estimates, for which a plethora of symbols can be applied. We can use
pA |
ˆ |
for an estimated value, and π A for a parametric assumption about |
to represent an observed value, pA |
pA |
ˆ |
|
or pA . To avoid the confusion of probabilities and proportions that can be represented with p or P |
symbols, we shall here use “hatted Greeks,” such as πˆ A , for estimated values. The main basic estimate
|
|
|
|
|
|
|
|
ˆ |
ˆ |
|
|
|
|
is π B , the value anticipated for |
pB in the “reference” group, which can be receiving no treatment, |
placebo, or a compared active agent. After πˆ B is chosen, pA can be immediately estimated as |
πˆ A |
= πˆ B |
± δ . For reasons discussed throughout Section 14.6, however, the choice between +δ or δ |
is usually |
avoided, and the increment is usually estimated in a noncommittal two-tailed manner as |
|
πˆ A – πˆ B |
|
|
= δ . |
|
|
© 2002 by Chapman & Hall/CRC
As the counterpart of the common proportion, P, the value of π |
is estimated as |
πˆ = (nA πˆ A + nBπˆ B )/Ν |
[14.14] |
The counterpart of Q = 1 − P cannot be easily written in Greek (which has no symbol for Q), and is therefore expressed as 1 πˆ . The requirements of Formula [14.14] bring in the values of nA, nB , and N for which sample size will eventually be calculated.
14.5.4Decision about Sample Allocation
Although aimed at determining the total sample size, N, the Z formula cannot be applied until the standard error of the difference is estimated. For this estimate, we need to use πˆ = (nA πˆ A + nB πˆ B ) /N, but, in addition to not knowing the value of N, we do not know the values of either nA or nB.
14.5.4.1 Equal Sample Sizes — The approximation of nA and nB is usually resolved by assum - ing that the two groups will have equal size, so that nA = nB = N/2. We can then use n = N/2 for the size of each group. With this decision, the value of πˆ becomes (nπˆ A + n πˆ B )/2n = (πˆ A + πˆ B )/2 . The value of N/(nAnB) becomes 2n/[(n)(n)] = 2/n. The value of SEDo is then estimated as
|
|
|
|
2 πˆ (1 – πˆ )/n |
(With |
the |
alternative-hypothesis formula, the estimated |
ˆ |
ˆ |
· |
ˆ |
ˆ |
[π A (1 – π |
A ) + |
π |
B (1 – π B )]/n . ) |
[14.15]
standard error for SEDH would be
14.5.4.2 Unequal Sample Sizes — If for some reason the investigator wants unequal sized
groups, the process is still relatively easy to manage. In this situation the samples will be divided as nA = kN and nB = (1 k)N, where k is a prespecified value for the sample-allocation fraction. For example,
if twice as many people are wanted in the treated group as in the control group, and if n B is the size of |
the control group, the sample-allocation fraction will be k = 1/3, with 1 |
k = 2/3. The value of πˆ |
will |
be (kNπˆ B + (1 – k)Nπˆ A )/N = kπˆ B + (1 – k)πˆ A . The value of N/nAnB |
will |
be N/[(kN)(1 |
k)N] = |
1/[(k)(1 – k)N]. The standard error |
would |
then be estimated |
from |
the NPQ/(nAnB) formula as |
[kπˆ B + (1 – k)πˆ A ]/(k)(1 – k)N] . I f |
k i s |
s p e c i f i e d a s |
1 / 3 , |
t h e |
f o r m u l a w o u l d |
b e |
[(πˆ B /3) + (2πˆ A /3 )]/[2N/9], which becomes |
[3πˆ B + 6πˆ A ]/2N . |
|
|
|
|
In all the estimates discussed here, the samples will be calculated, in the conventional manner, as having equal sizes.
14.5.4.3 Tactics in Calculation — With
do |
[14.16] |
Zo = ----------- |
SED |
|
as the customary formula for getting to a P value, we now seem to have everything needed to calculate sample size by suitably substituting into Formula [14.16], where do = observed difference in the two proportions, SED = standard error of the difference, and Zo is the value of Z for the observed results. The substituted values will be the anticipated δ = πˆ A – πˆ B for do; and the anticipated SED, for equal sample sizes, will be SED =
2 πˆ (1 – πˆ )/n . When these values are substituted, we can solve Expression [14.16] for n (or N if an appropriate formula is used for unequal sample sizes).
The Z value at δ will be calculated as
|
Zδ = |
δ |
[14.17] |
|
2πˆ (1 – πˆ )/n |
|
|
|
© 2002 by Chapman & Hall/CRC
For stochastic significance we want Zδ to exceed the selected level of Zα . Therefore,
|
δ |
≥ Zα |
[14.18] |
--------------------------------- |
|
2πˆ (1 – πˆ )/n |
|
To solve for n, we square both sides and determine that |
|
n ≥ |
2πˆ |
(1 – πˆ )(Zα2 ) |
[14.19] |
---------------------------------- |
δ 2 |
|
|
|
14.6 Problems in Choosing πˆ
Formula [14.19] is the standard recommendation for calculating sample size if the goal is to achieve results in which either P < α or the 1 α confidence interval has a suitable lower end. The formula relies
on the assumption that π |
will be calculated for equal sized groups as (πˆ A + πˆ B )/2 , but the formula does |
not indicate whether |
πˆ A |
is expected to be larger or smaller than πˆ B . If πˆ A = πˆ B + δ , πˆ becomes |
πˆ B + (δ/ 2 ). If πˆ A = |
πˆ B – δ , πˆ becomes πˆ B – (δ/ 2 ) . |
14.6.1 Directional Differences in Sample Size
The different directional possibilities for π can substantially affect the variance calculated as πˆ (1 – πˆ ) in the numerator of Formula [14.19]. For example, suppose δ = .15 is set as an impressive difference in two proportions and πˆ B = .33 is estimated for Group B. If Group A has a larger value, πˆ A will be .33 + .15 =
.48, πˆ will be (.33 + .48)/2 = .405, and πˆ (1 – πˆ ) will be (.405)(.595) = .2410. Conversely, if Group A has a smaller value, πˆ A will be .33 .15 = .18, πˆ will be (.18 + .33)/2 = .255, and πˆ (1 – πˆ ) will be (.255)(.745) = .1900.
With Zα set at a two-tailed 1.96 for α = .05, the insertion of the foregoing values in Formula [14.19] will produce n ≥ [2(.2410)(1.96)2]/(.15)2 = 82.3 for the first estimate, and n ≥ [2(.1900)(1.96)2]/(.15 )2 = 64.9 for the second. The first sample size will exceed the second by the ratio of the two variances, which is .2410/.1900 = 1.27, a proportionate increase of 27%.
Consequently, the direction of the anticipated increment in proportions does more than just affect the choice of a oneor two-tailed level of α . It extends into the “guts” of calculating SED and, thereby, sample size.
14.6.2Compromise Choice of πˆ
The different values of n occurred because the value of πˆ A was estimated directionally as πˆ B + δ in one instance and as πˆ B – δ in the other. For a noncommittal two-tailed hypothesis, a compromise decision
using δ/ 2, can produce estimates for a different πˆ A ′ and πˆ B ′ instead of πˆ A and πˆ B . With this compromise, |
the value πˆˆ′(1 – πˆ ′) will always have the same result whether |
πˆ A |
is larger or smaller than πˆ B . |
Thus, we can estimate |
πˆ B ′ as πˆ B – (δ/ 2 ) and πˆ A ′ = πˆ B + (δ/ |
2) . Alternatively, πˆ B ′ can be estimated |
as πˆ B + (δ/ 2 ) |
and πˆ A ′ |
will be πˆ B – (δ/ 2 ) . In both situations, |
πˆ A ′ – πˆ B ′ |
|
will |
be δ ; the common |
|
proportion πˆ ′ |
will be (πˆ A ′ + πˆ B ′)/2 = πˆ B ; and the “compromise” value of πˆˆ′(1 |
|
– πˆ ′) |
will be πˆ B (1 – πˆ B ) . |
In the foregoing example, with δ set at .15, πˆ A ′ might be .33 + .075 = .405 and πˆ B would be .33
.075 = .255, with πˆ ′ = (.405 + .255)/2 = .330, which is the value of πˆ B . Alternatively, πˆ A ′ could be
.33 .075 = .255 and πˆ B ′ would be .33 + .075 = .405, yielding the same result for πˆ ′ .
In giving the same result for the substitute values of πˆ ′(1 – πˆ ′) , regardless of direction, the compromise choice of πˆ B(1 – πˆ B ) avoids the two conflicting sample-size estimates created by a directional choice, but the compromise uses πˆ B for the single estimated value.
© 2002 by Chapman & Hall/CRC
14.6.3“Compromise” Calculation for Sample Size
With the compromise tactic, the “theoretical” Formula [14.19] becomes the “working” formula
|
n ≥ |
2πˆ B (1 – πˆ B )(Zα2 ) |
[14.20] |
|
----------------------------------------δ 2 |
|
|
|
For the data under discussion, Formula [14.20] will produce n Š [(2)(.33)(.67)(1.96) 2]/(.15)2 = 75.5. As might have been expected, this value lies between the two previous estimates of 82.3 and 64.9. With the compromise choice, the results should be stochastically significant, at 2P ≤ .05, if the sample size exceeds 76 persons in each group. The total sample size would be 152.
The compromise tactic has two potential hazards. One of them occurs in a really close situation where pA turns out to be pB + δ . The compromise sample size—which is smaller than the 166 needed for the larger estimate—may just fail to bring the calculated P value or confidence interval across the desired stochastic border. The second hazard is a scientific problem discussed in the next section.
14.6.4Implications of Noncommittal (Two-Tailed) Direction
Although statistically comforting and conventional, a noncommittal two-tailed direction has important scientific implications. They suggest that the investigator is disinterested in the direction of the results, does not care which way they go, and is essentially working in alaissez faire manner, without a scientific hypothesis. The indifference, of course can be statistically proper and valuable if pA pB might acceptably go in either direction, for which a two-tailed null hypothesis is necessary.
The scientific idea being tested in the research itself, however, is the counter-hypothesis of the stochastic null hypothesis, which is established as πˆ A – πˆ B = 0 , so that it can be rejected. With the rejection, we accept the counter-hypothesis, but it can be either a directional πˆ A – πˆ B , with ∆ > 0 or ∆ < 0, or a noncommittal πˆ A – πˆ B with ∆ ≠ 0.
Very few scientific investigators would be willing to invest the time and energy required to do the research, particularly in a full-scale randomized trial, without any idea about what they expect to happen and without caring which way it turns out. Furthermore, if the expectation were actually completely indifferent, very few agencies would be willing to fund the research proposal. A distinct scientific direction, therefore, is almost always desired and expected whenever an active treatment is compared against no treatment or a placebo; and even when two active treatments are compared, the investigator is almost never passively indifferent about an anticipated direction.
Nevertheless, the noncommittal two-tailed approach has been vigorously advocated, commonly employed, and generally accepted by many editors, reviewers, and readers. The one-tailed approach, however, is probably more important when the results are interpreted afterward than when sample size is calculated beforehand.
The reason for the latter statement is that the selection of πˆ often becomes a moot point after the augmentations discussed in Section 14.6.5, which usually increase the calculated sample size well beyond the largest value that emerges from different choices of πˆ . Thus, the possibility of a one-tailed interpretation is most important after the study is completed, particularly if pA and pB have a relatively small increment.
14.6.5Augmentation of Calculated Sample Size
Regardless of how πˆ is chosen, cautious investigators will always try to enroll groups whose size is much larger than the calculated value of n.
If everything turns out exactly as estimated, the observed value of pA – pB = do will be just ≥ δ , and do/SED should produce a value of Zo that exactly equals or barely exceeds Zα . The expected result can easily be impaired, however, by various human and quantitative idiosyncrasies. Some or many of the people enrolled in the trial may drop out, leaving inadequate data and a reduced group size. Some of the anticipated success rates may not turn out to be exactly as estimated.
© 2002 by Chapman & Hall/CRC
For example, suppose the investigator enrolls 76 persons in each group of the foregoing trial, but data became inadequate or lost for two persons in each group. Suppose the available results came close to
what was anticipated for the two groups, but the actual values were pA = 36/74 = .486 and pB = 25/74 = |
.338. When Zo |
is determined for these data, the result is [(36/74) (25/74)]/ [(61 )(87 )]/[(148 )(74)(74)] |
= .1486/.0809 |
= 1.84, which falls just below the Zα = 1.96 needed for two-tailed “significance.” (With |
everything already determined and calculated, it is too late for the investigator now to claim a unidirec - tional hypothesis and to try to salvage “significance” with a one-tailed P.)
For all these reasons, the calculated sample size is always augmented prophylactically. With the augmentation, some of the ambiguities in one/two-tailed directions and decisions for the estimated πˆ are eliminated, because the sample size usually becomes large enough to manage all the stochastic possibilities. On the other hand, if many of the augmented people really do drop out or produce unusable data, the results may eventually depend on the originally calculated sample size. In such circumstances, the decisions for magnitude and direction of α, and the estimates of πˆ and SED, can have important effects in either inflating the calculated sample size or reducing the eventual stochastic significance.
14.6.6“Fighting City Hall”
This long account of problems in calculating sample size for two proportions is relatively unimportant if you are a reader of reported results, rather than a doer of research. As a reader, you will see what was observed and reported, together with the calculated P values and confidence intervals. You need to know the difference between oneand two-tailed decisions in the calculations and contentions, but the original sample-size strategies do not seem important. (They do become important, however, as noted later in Chapter 23, if the observed difference, do , was substantially smaller than the anticipated δ and became stochastically significant only because of an inflated sample size.)
As an investigator, however, your ability to carry out the research will directly depend on the estimated sample size. If it is too big, the project may be too costly to be funded, or too unfeasible to be carried out. Accordingly, you will want to get the smallest sample size that can achieve the desired goals of both quantitative and stochastic significance. For this purpose, you would use unidirectional scientific hypotheses, uncompromised choices for π , and one-tailed values of Zα . You may then need “ammunition” with which to “fight city hall,” if your statistical consultants or other authorities disagree.
The customary advice is to calculate sample size in a perfunctory manner by estimating πˆ B , using the compromise δ/ 2 strategy for estimating πˆ A ′ and πˆ B ′ and then employing Formula [14.20]. The value of πˆ A is almost never estimated, however, as πˆ B – δ . Not fully aware of the numerical consequences of the bidirectional or unidirectional decisions, an investigator may not realize that the subsequent sample sizes can sometimes become much larger than they need to be. Fearing accusations of statistical malfeasance, an enlightened investigator may be reluctant, even when scientifically justified, to insist on declaring a direction, changing from a two-tailed to one-tailed value of Zα , or estimating πˆ A , when appropriate, as πˆ B – δ rather than πˆ B ± δ /2 .
The great deterrent for most investigators, of course, is the risk of defeating “city hall” in the calculated sample-size battle, but then losing the “war” for statistical approval when the final results are submitted for publication. If the reviewers insist on a two-tailed Zα , the group sizes that so efficiently yielded a one-tailed P < .05 may not be big enough to succeed in rejecting the two-tailed null hypothesis. The results themselves may then be rejected by the reviewers.
In an era of cost–benefit analyses for almost everything, however, some of the exorbitant costs of huge sample sizes for large-scale randomized trials might be substantially reduced if appropriate re-eval- uation can be given to the rigid doctrine that always demands nondirectional, noncommittal two-tailed approaches for all research.
14.7 Versatility of X2 in Other Applications
Because stochastic contrasts for two proportions can readily be done with the Z or Fisher Exact procedures, chi-square is particularly valuable today for its role in other tasks that are briefly mentioned
© 2002 by Chapman & Hall/CRC
now to keep you from leaving this chapter thinking that chi-square may soon become obsolete. Except for the following subsection on “goodness of fit,” all the other procedures will be discussed in later chapters.
14.7.1“Goodness of Fit”
The “observed minus expected” tactic is regularly employed (see Section 14.1.1) to determine how well a particular model fits the observed categorical frequencies in a “one-way table,” for a single group of univariate data.
For example, in genetic work with certain organisms, we expect the ratios AB :Ab :aB :ab to be 9:3:3:1. A set of real data may not follow these ratios exactly, but their goodness of fit to the expected model can be determined using chi-square. The expected values in each category would correspond to the cited ratios multiplied by the total number of observations. Thus, if a group of genetic observations gave corresponding totals of 317, 94, 109, and 29, we could calculate X2 for the expected ratios as follows:
GROUP |
AB |
Ab |
aB |
ab |
TOTAL |
|
|
|
|
|
|
Observed Value |
317 |
94 |
109 |
29 |
549 |
Expected Ratio |
9/16 |
3/16 |
3/16 |
1/16 |
1 |
Expected Value |
308.8 |
102.9 |
102.9 |
34.3 |
548.9 549 |
(0 − E)2/E |
0.22 |
0.78 |
0.36 |
0.82 |
2.18 = X2 |
|
|
|
|
|
|
With 4 groups and 3 degrees of freedom, the value of P for ν = 3 at X2 = 2.18 is > .05 and so we cannot reject the idea that the observed data are well fitted to the 9:3:3:1 ratio.
Some statisticians, such as J. V. Bradley,9 refer to this use of chi-square as a test for “poorness of fit.” The name seems appropriate because the stochastic test can “prove” only that the selected “model” fails to fit well, not that it is good or correct. The decision that the model is correct is an act of scientific judgment rather than statistical inference.
An alternative to the chi-square goodness-of-fit test for a one-way table is cited here mainly so you can savor the majestic sonority of its eponymic designation. Called the Kolmogorov–Smirnov test,10,11 it compares the cumulative frequencies of the observed distribution against the cumulative frequencies of the expected distribution. (The chi-square test compares the observed and expected frequencies for individual rather than for cumulative categories.)
14.7.2McNemar Test for “Correlated Proportions”
The X2 test was adapted by Quinn McNemar, a psychologist, for examining “correlated proportions” in a manner that reduces two sets of results to one set, somewhat like the one-sample Z (or t) procedure for dimensional data. Being often used in medical research to test agreement among observers, the McNemar procedure will be discussed among the indexes of concordance in Chapter 20.
14.7.3Additional Applications
Additional applications of X2, all to be discussed later, can be listed as follows: getting a Miettinen-type of confidence interval12 for an odds ratio, testing for stochastic significance in 2-way tables where either the rows or columns (or both) have more than 2 categories, evaluating linear trend for binary proportions in an ordinal array, doing the Mantel–Haenszel test for multiple categories in stratified odds ratios, and preparing the log-rank test calculations to compare two survival curves.
14.8 Problems in Computer Printouts for 2 × 2 Tables
If you do a chi-square test yourself, the formulas cited earlier are easy to use with a hand calculator. If you do not have a printing calculator to show that you entered the correct numbers, always do the
© 2002 by Chapman & Hall/CRC
calculation twice, preferably with different formulas. If you use a “packaged” computer program for a 2 × 2 table, always try to get the results of the Fisher Test as well as chi-square.
The main problems in the computer programs are that a confusing printout may appear for the Fisher Test and that a deluge of additional indexes may appear as unsolicited (or requested) options. Because 2 × 2 tables are so common, their analytic challenges have evoked a plethora of statistical indexes, which can appear in the printouts. Many of the additional expressions are used so rarely today that they are essentially obsolete. Other indexes were developed for tables larger than 2 × 2, but are available if someone thinks they might be “nice to know about” in the 2 × 2 situation. Still other indexes represent alternative (but seldom used) approaches to evaluating stochastic significance.
The diverse additional indexes are briefly mentioned here so that you need not be intimidated by their unfamiliarity when deciding to ignore them.
14.8.1Sources of Confusion
The Fisher Test results can be presented in a confusing manner. In one packaged program (SAS system), the results are called “left,” “right,” and “2-tail.” The sum of the “left” and “right” probability values, however, exceeds 1 because the original (observed) table was included in the calculation for each side. This confusion is avoided in the BMDP system.
The BMDP system, however, offers results for a chi-square test of “row relative symmetry” and “col relative symmetry.” The calculation and meaning of these results, however, is not adequately described in the BMDP manual. Communication with the BMDP staff indicates that the test is used to assess results in 2 × 2 tables that are agreement matrixes (see Chapter 20). For most practical purposes, and even for agreement matrixes, the “symmetry” chi-square tests can be ignored.
14.8.2Additional Stochastic Indexes
Beyond the results for Fisher exact test and for X2 with and without the Yates correction, X2 can be presented with two other variations. One of them, called Mantel–Haenszel X2, is calculated with N 1 replacing N in the customary formula of [(ad bc)2 N]/(n1 n2 f1 f2) for a 2 × 2 table. The reason for the change, discussed in Chapter 26, is that Mantel and Haenszel calculate variance with a finite-sampling correction.
The other stochastic test, sometimes called the log likelihood chi-square, was first proposed by Woolf.13 It is usually symbolized as G, and calculated as 2Σ [oi × ln (oi /ei )] . In this formula, the symbols oi and ei refer respectively to the observed and expected values in each of the four cells, and ln is the natural logarithm of the cited values. G is interpreted as though it were an X2 value, which it closely approximates. For a 2 × 2 table having the conventional format shown in Section 14.2.4, an easy calculational formula
is G = 2[a (1n a) + b (1n b) + c (1n c) + d (1n d) + N (1n N) n1 (ln |
n1) − n2 (ln n2) − f1(ln f1) f2 |
(ln f2)]. When applied to the data in Table 14.1, this calculation producesG |
= 2.81, whereas the previous |
X2 was 2.73.
Although some statistical writers14,15 believe that G is routinely preferable to X2, both of the extra approaches just cited for X2 add little beyond the customary stochastic tests used for 2 × 2 tables.
14.8.3Additional Descriptive Indexes
As noted in subsequent chapters, a 2 × 2 table can also be analyzed with diverse indexes of an association between two variables. The indexes are mentioned here as a brief introduction before you meet them, perhaps uninvitedly, in computer printouts. Although sometimes valuable for larger tables (as discussed in Chapter 27), the indexes of association for 2 × 2 tables usually add nothing beyond confusion to what is already well conveyed by the increments and ratios (see Chapter 10) used as indexes of contrast.
14.8.3.1 φ Coefficient of “Correlation” — The φ index, calculated as 
X2 /N , is the “correlation coefficient” for a 2 × 2 table. φ is mentioned later in Chapter 20 as a possible (but unsatisfactory) index of concordance and has a useful role, discussed in Chapter 27, in indexing associations for larger tables.
© 2002 by Chapman & Hall/CRC
14.8.3.2 Obsolete Indexes of 2 × 2 Association — The five additional indexes cited in this section were once fashionable, but almost never appear in medical literature today. The contingency
coefficient, C, is calculated as |
X2 /(N + X2 ). Cramer’s V, which is a transformation of X2, reduces to φ |
for a 2 × 2 table. In a table having the form |
a b |
|
, Yule’s Q is (ad bc)/(ad + bc). Yule also proposed an |
index of colligation, Y = ( ad |
bc )/( ad |
c+d |
|
bc ). The tetrachoric correlation is a complex coefficient |
involving a correlation of two bivariate Gaussian distributions derived from the probabilities noted in the 2 × 2 table.
14.8.3.3Indexes of Association in Larger Tables — Several other indexes of association, developed for application in tables larger than 2 × 2, sometimes seek prestige and fame by appearing in computer printouts of analyses for 2 × 2 tables. Those indexes, discussed later in Chapter 27, are
Gamma, Kendall’s tau-b, Stuart’s tau-c, Somers’ D, and several variations of lambda and of an “uncertainty coefficient.” The results of all these indexes can safely be disregarded for 2 × 2 tables.
14.8.3.4Indexes of Concordance — Tests of concordance, as discussed later in Chapter 20, can be done when each variable in a two-way table offers a measurement of the same entity. The agreement (or disagreement) in these measurements is expressed with indexes such as McNemar symmetry and Kappa. They would pertain only to the special type of “agreement matrix” discussed in Chapter 20, not to an ordinary 2-way table.
References
1. Abbe, 1906; 2. Wallis, 1956, pg. 435; 3. Yates, 1934; 4. Cochran, 1954; 5. Gart, 1971; 6. Yates, 1984; 7. Cormack, 1991; 8. Fleiss, 1986; 9. Bradley, 1968; 10. Kolmogorov, 1941; 11. Smirnov, 1948; 12. Miettinen, 1976; 13. Woolf, 1957; 14. Everitt, 1977; 15. Williams, 1976; 16. Norwegian Multicenter Study Group, 1981; 17. Petersen, 1980; 18. Steering Committee of the Physicians’ Health Study Research Group, 1989.
Exercises
14.1.In a randomized clinical trial of timolol vs. placebo in patients surviving acute myocardial infarction,16 the total deaths at an average of 17 months after treatment were reported as 98/945 in the timolol group and 152/939 in the placebo group. The authors claimed that this difference was both
quantitatively and statistically significant. What calculations would you perform to check each of these claims? What are your conclusions?
14.2.Here is the verbatim summary of a clinical report:17 “Twelve patients with chronic mucocutaneous candidiasis were assigned by random allocation to a 6-month course of treatment with Ketoconazole or placebo in a double-blind trial. All six recipients of Ketoconazole had remission of symptoms and virtually complete regression of mucosal, skin, and nail lesions, whereas only two of the six receiving
placebo had even temporary mucosal clearing, and none had improvement of skin or nail disease. The clinical outcome in the Ketoconazole-treated group was significantly more favorable (p =· 0.001) than
in the placebo-treated group.”
What procedure(s) do you think the authors used to arrive at the statistical inference contained in the last sentence of this summary? How would you check your guess?
14.3.In Mendelian genetics theory, the crossing of a tall race of peas with a dwarf race should yield a ratio of 3:1 for tall vs. dwarf peas in the second generation. In his original experiment, J. G. Mendel got 787 tall and 277 dwarf plants in a total of 1064 crosses. Is this result consistent with the 3:1 hypothesis? If so, why? If not, why not?
14.4.In the famous Physicians’ Health Study,18 the investigators reported a “44% reduction in risk” of myocardial infarction for recipients of aspirin vs. placebo. The results had “P < .00001” for 139
©2002 by Chapman & Hall/CRC
myocardial infarctions among 11,037 physicians in the aspirin group, compared with 239 infarctions among 11,034 physicians assigned to placebo. For purposes of this exercise, assume that the “attack rate” in the placebo group was previously estimated to be 2%.
14.4.1.What feature of the reported results suggests that the sample size in this trial was much larger than needed?
14.4.2.Before the trial, what sample size would have been calculated to prove stochastic significance if the estimated attack rate were reduced by 50% in the aspirin group?
14.4.3.The authors report that they enrolled “22,071” physicians, who were then divided almost equally by the randomization. How do you account for the difference between this number and what you calculated in 14.4.2?
14.5.Exercise 10.6 contained the statement that a comparison of two “baseball batting averages, .333 vs. .332, can become stochastically significant if the group sizes are large enough.” What should those group sizes be to attain this goal? Do a quick mental estimate and also a formal calculation.
© 2002 by Chapman & Hall/CRC