Ординатура / Офтальмология / Английские материалы / Principles Of Medical Statistics_Feinstein_2002
.pdf
two-tailed test. (All these directional decisions should be made before the value of t is calculated, and preferably before any of the statistical analysis begins.)
5. Calculate t as:
(XA – XB ) ⁄[sp
(1 ⁄nA ) + (1 ⁄nB )]
6.Using nA + nB − 2 degrees of freedom, look up the corresponding P value in Table 7.3 or in a more complete table.
7.Using the preselected value of α , decide whether the P value allows you to reject the null hypothesis and to proclaim stochastic significance.
13.3.3.2 Getting a Confidence Interval — If you prefer to use the confidence interval technique for checking stability, the steps are as follows:
1. Carry out the previous steps 1 to 4 to find the estimated standard error of the difference, sp
(1/nA ) + (1/nB ).
2.Select the 1 − α level of confidence and find the value of t ν ,α that is associated with this value of α , with ν = nA + nB − 2.
3.Multiply the selected t value by sp
(1/nA ) + (1/nB ).
4.Obtain the 1 − α confidence interval by adding and subtracting the result of step 3 to the value of XA – XB.
5.Concede or reject the null hypothesis according to whether this confidence interval includes or excludes the value of 0.
After you have finished the procedure, particularly if the conclusion is what you had hoped for, obtain a Guinness product and drink a toast to the memory of William S. Gosset. If you prefer, get a German beer and say “prosit” for Carl F. Gauss.
13.3.4Electronic Computation
With modern electronic computation, you may never actually have to do the calculations of t or Z. If properly commanded, an appropriate program will do the calculations and promptly give you the results.
You may also never have to use tabular arrays, such as Table 7.3, either to find the associated P values or to guess about interpolated results when the calculated t or Z lies between the values tabulated at P levels of .2, .1, .05, .025, .01, etc. Many “packaged” computer programs today will use the mathematical formulas of standard t and Gaussian distributions to obtain the exact probabilities associated with any value of t and Z. The printout will show you both the t or Z value and a precisely calculated external probability such as P = .063.
The only potential problem in the printouts is that they may not show the actual magnitude of very tiny P values. If P is below .00005, the result may be printed as P = .0000. At even tinier P values, below 1 × 10−16, the printout may show P = •. Unless you know about these symbolic conventions, the printout may initially be confusing.
The calculational formulas are necessary if you have to do the computations yourself, if you want to check whether an electronic program has done them correctly, or if you occasionally do a mental screening of presented results. You also need to know about the underlying strategy when you decide about 1- or 2-tailed interpretations of P.
13.3.5Robustness of Z/t Tests
A statistical test is called robust if it is “insensitive” to violations of the assumptions on which the test is based. For example, because the means of small-sized groups do not have a Gaussian distribution,
© 2002 by Chapman & Hall/CRC
the Z procedure is not robust and will give inaccurate results at small group sizes. The t distribution is therefore used for small groups when P values are interpreted for the “critical ratio” in Formula [13.7].
For the two-group t or Z procedures, the underlying assumption that is most likely to be violated is the idea that the observed variances, sA2 and sB2 , come from the same parametric population, with variance σ 2, estimated by the pooled variance, sp2 . If sA2 and sB2 , however, are highly heteroscedastic (which is a splendid word for having dissimilar variances), the invalid assumption might lead to defective values for the critical ratio and its interpretation. To avoid this potential problem, some statisticians will do a “screening” check that the ratio of variances, sA2 /sB2 , does not greatly depart from 1. This screening is seldom used, however, because most of the concerns were eliminated after a classical study by Boneau.1 He did a series of Monte Carlo samplings showing that the t test was “remarkably robust” except in the uncommon situation where both the group sizes and the variances in the two groups are substantially unequal. In the latter situation, the best approach is to use a Pitman-Welch test.
13.3.6Similarity of F Test
The published literature sometimes contains a report in which the means of the two groups have been stochastically contrasted with an F test. The F test strategy, which is customarily used in a procedure called analysis of variance (discussed later in Chapter 29), was originally developed for contrasting the means of three or more groups, rather than two. For two-group contrasts, however, the calculated value of F is the exact square of the critical ratio used for t, and the squared ratio is interpreted with an F, rather than t, distribution. Nevertheless, the P values that emerge from the F test in a two-group contrast are identical to those obtained with a t test.
The F test is also used when heteroscedasticity is checked for the ratio of the two variances, sA2 and sB2 .
13.4 Components of “Stochastic Significance”
Formula [13.7] indicates that the stochastic calculation of t or Z is a product of several components, which produce the conversion of quantitative into stochastic indexes. One of these components is the standardized increment, which you may recall from Section 10.5.1. The other two components reflect the size and partition of the total group under examination. The diverse components become apparent if we rewrite Formula [13.7] as
t or |
=Z X-------------------A – XB × --------------------------------------------- |
1 |
[13.10] |
|
sp |
(1 ⁄nA ) + (1 ⁄nB ) |
|
13.4.1Role of Standardized Increment
In Chapter 10, the standardized increment was briefly introduced as an index of contrast to describe “effect size” in a comparison of two central indexes. With sp as a common standard deviation for two
groups of dimensional data, having means XA and XB , the standardized increment for the contrast is
XA – XB
-------------------
sp
The standardized increment is a descriptive index of contrast because it involves no act of inference (other than the use of degrees of freedom in the calculation of sp). The sp factor, however, makes the standardized increment an index of contrast that uniquely provides for variability in the data. This provision is absent in all of the other indexes of contrast discussed in Chapter 10.
The provision for variability gives the standardized increment another vital role (noted earlier) as the bridge that connects descriptive and inferential statistics for a contrast of two groups. The t or Z (and also later the chi-square) test statistics are all constructed when the standardized increment is multiplied
© 2002 by Chapman & Hall/CRC
by factors indicating the magnitude (and partition) of group sizes. The stochastic statistics of t, Z, and chi-square are thus a product of factors for the standardized increment and group sizes.
13.4.2Role of Group Sizes
The group sizes are reflected by 1/
(1 ⁄nA ) + (1 ⁄nB ) , the other factor in Formula [13.10]. Because (1/nA) + (1/nB) = N/nAnB, division by this factor can be rewritten as the multiplicative entity
nA nB /N, which is the group size factor. Thus,
t (or Z ) = Standardized × |
Group Size . |
[13.11] |
Increment |
Factor |
|
The group size factor is decomposed into two parts when the total group, N, is divided into the two
groups, nA and nB. If we let k = nA/N, then nA = kN and nB = N − nA = (1 − k)N, and the group size factor will become
kN(1 – k)N/N =
k(1 – k)
N.
The value of (k)(1 − k) will depend on the partition of nA and nB. If the group sizes are equal, nA = nB = |
|
N/2 and (k)(1 − k) = 1/4 so that (k)(1 – k ) = .5. As the partition becomes more unequal, |
(k)(1 – k ) |
will become smaller. Thus, if nA = 20 and nB = 80, (.2 )(.8 ) = .4. If nA = 10 and nB = 190, |
(.05 )(.95 ) |
= .22. If we call
(k)(1 – k ) the group partition factor, the stochastic test result becomes a product of three factors:
|
|
|
|
|
|
× |
Group |
× |
Total |
|
t (or Z ) = Standardized |
Partition |
Size |
|
|||||||
|
Increment |
|
Factor |
|
Factor |
|
||||
or |
|
|
|
|
|
|
|
|
|
|
|
|
|
B |
|
|
|
|
|
||
t (or Z ) |
|
X |
A – |
X |
× |
k(1 – k ) × |
|
N |
[13.12] |
|
= ------------------- |
|
|||||||||
sp
Formula [13.12] offers a striking demonstration of the role of group sizes in achieving stochastic significance. The values of t (or Z) will always increase with larger groups, because
N is a direct multiplicative factor. The values will also increase with equally partitioned groups, because
k(1 – k ) achieves its maximum value of .5 when nA = nB. The values get smaller as the proportion for k departs from the meridian of .5 toward values of 1 or 0.
13.4.3Effects of Total Size
Of the various components of Formula [13.12], the most dramatic is the total group size, N. No matter how small the magnitudes may be for the standardized increment and the group partition factor, a sufficiently large value of
N can multiply their product into a value that exceeds the Zα (or tν ,α ) needed for stochastic significance. Conversely, even if the standardized increment is quantitatively impressive and the group partition factor is a perfect .5, the result for the test statistic will still fail to achieve stochastic significance if 
N is too small.
For example, suppose the standardized increment is an extraordinarily low value of .001 and the group partition factor is a greatly unbalanced .04. To exceed the value of Zα = 2 for stochastic significance, all we need is
2 ≤ 5 (.001)(.04)
N
and so

N ≥ 2/(.001)(.04) = 50,000
© 2002 by Chapman & Hall/CRC
The necessary value of N will be the square of this value, which is 2.5 × 109 or 2.5 billion. Archimedes is said to have remarked, “If you give me a lever long enough and strong enough, I can move the earth.” If the total group size is large enough, any observed distinction can be stochastically significant.
Conversely, suppose the standardized increment is an impressively high value of .75 and the grouppartition factor is a splendid 0.5. The total product will fail to exceed a value of Z = 2 if
2 ≥ (.75)(.5) 
N
which occurs when

N ≤ 2/[(.75)(.5)] = 3.27
This situation would arise with a total group size of (3.27)2 = 10.7 or anything smaller than 10.
13.4.4Simplified Formulas for Calculation or Interpretation
Formula [13.10] can be converted to
|
|
|
|
|
|
|
|
t (or Z ) = |
XA – XB |
nA nB /N |
[13.13] |
||||
------------------- |
|
|
|
|
|||
|
|
|
sp |
|
|
||
In addition to its use in computing the value of t or Z, the formula offers a way of promptly interpreting both quantitative and stochastic significance. The standardized increment, (XA – XB )/sp , can be used as a guide to interpreting quantitative significance. The group size factor,
nA nB /N, will indicate the role played by the numerical sizes of the groups in converting the standardized increment to stochastic significance.
13.4.4.1 Equal Group Sizes — When the groups have equal sizes, so that nA = nB = n = N/2, Formula [13.13] can receive two simplifications. The factor
nA nB /N will become 
(n)(n)/(2n ) = 
n/2. The value of sp2 earlier in Formula [13.3] will become
[(n − 1) sA2 + (n − 1) sB2 ]/[2(n − 1)] = ( sA2 |
+ sB2 )/2 |
||||||||
so that the values of sC and sC′ are identical. Formula [13.13] will then become |
|||||||||
|
|
|
|
|
|
|
|
|
|
t (or Z ) = |
-- ----------------------------X A – X B |
( n/2) |
|
||||||
|
( s A 2 + s B 2 ) /2 |
|
|
||||||
which can be further reduced to |
|
|
|
|
|
|
|
|
|
|
|
|
B ( |
|
|
||||
t (or Z ) -------------------- |
= |
X |
A – |
X |
n) |
[13.14] |
|||
|
|
|
sA2 + sB2 |
|
|
||||
(Remember that n here is the size of one group.)
13.4.4.2Use of Cited Standard Errors — In many instances, the investigators report the
results of the two means, the two group sizes, and the two standard errors as SE A and SEB. If you want to check the claims for t (or Z), the SE values would have to be converted to standard deviations as
nA (SEA) and |
nB (SEb). To avoid this work, you can use the “alternative-hypothesis” Formula [13.8] |
|||||
and estimate sC2 |
directly as (SEA)2 + (SEB)2. Because the group size factor is incorporated when standard |
|||||
errors are determined, as (sA2 /nA + sB2 /nB ), Formula [13.8] becomes simply |
|
|||||
|
t (or Z ) = ( |
|
A – |
|
B )/ (SEA )2 + (SEB )2 |
(13.15] |
|
X |
X |
||||
© 2002 by Chapman & Hall/CRC
This formula will be slightly inaccurate, because the variance was not pooled, but the results are usually close enough for a quick “screening” check of reported values.
If the groups have equal size, Formulas [13.14] and [13.15] produce identical results, because the denominator of [13.15] becomes
(sA2 + sB2 )/
n.
13.5 Dissidence and Concordance in Significance
The formulas cited in Sections 13.4.1 through 13.4.4 indicate why separate evaluations are needed for quantitative and stochastic significance, and why the two evaluations may sometimes disagree. Because of small group size, a quantitatively significant difference may fail to achieve stochastic significance. Because of huge group size, a quantitatively trivial difference may become stochastically significant.
The distinctions also help explain some of the problems we encountered in Chapter 10 when trying to choose the magnitude of a quantitatively significant increment. Suppose we decide that we want concordant results for both stochastic and quantitative significance. Because Z ≥ 2 (or 1.96) is the minimum boundary for stochastic significance at two-tailed α = .05, the magnitude needed for a quantitatively significant increment will vary with the group size. If the total group size is 40, equally divided into two groups of 20, the formula will be
2 ≤ (S.I.)(.5)
40
and so the required standardized increment (S. I.) must exceed 2/[(.5)
40 ] = .63. If the sample size is larger, perhaps 100, and still equally divided, stochastic significance will occur when the standard - ized increment exceeds 2/[(.5)
100 ] = .40. With a larger, equally divided total group of 300, the required standardized increment must exceed 2/(.5)(
300 ) = .23.
Unless we carefully separate what is quantitatively important from what is stochastically significant, decisions about quantitative importance will be dominated by mathematical calculations of probability. The scientific boundaries set for magnitudes that are quantitatively important will frequently have to change—as noted earlier—according to the clinical, populational, or other biologic implications of the observed results. A much less scientifically acceptable source of varying boundaries, however, is the stochastic caprice caused by changes in group size.
In a particular biologic setting, a standardized increment of .4 either is or is not quantitatively significant; and it may or may not be derived from numbers large enough to be stochastically significant. The decision about quantitative significance, however, should depend on its magnitude and connotations in a biologic setting, not on the size of the groups under study. This principle will be further discussed and illustrated with many examples in Chapter 14.
13.6 Controversy about P Values vs. Confidence Intervals
During the past few years, the stochastic dominance of P values has been threatened by advocates of confidence intervals; and a lively dispute has developed between the “confidence couriers” and the “P- value partisans.” The advocates of confidence intervals want to supplement or replace P values because the single value of P does not give the scope of information offered by a confidence interval. The defenders want the P value preserved because it indicates the important distance from the stochastic boundary of α . Many highly respected statisticians and statistical epidemiologists have lined up on both sides of the controversy, urging either the use of confidence intervals or the retention of P values.
The confidence couriers seem to be winning the battle in prominent clinical journals. After presenting “tutorial” discussions to educate readers about the use of confidence intervals, the editors proudly announce their enlightened new policies demanding confidence intervals instead of (or in addition to) P values. The enlightened new policies are presumably expected to bring medical literature out of the statistical “dark ages” induced by previous editorial policies that set the demand for P values.
© 2002 by Chapman & Hall/CRC
13.6.1Fundamental Problems in Both Methods
An even-handed reaction to this controversy would be to agree with both sides, since both sets of contentions are correct and both sets of approaches are useful. In fact, as suggested by some of the disputing participants, truly enlightened editors could ask for both types of information: confidence intervals (to show the potential range of results) and P values (to keep authors and readers from being infatuated or deceived by an arbitrarily chosen interval).
Regardless of the opposing viewpoints, however, the argument itself resembles a debate about whether to give an inadequate medication orally or parenterally.2 The argument ignores the reciprocal similarity of P values and confidence intervals, while also overlooking four fundamental problems: (1) the absence of clearly stated goals for the statistical procedures, (2) the arbitrariness of the α and Zα boundaries that determine both the interpretation of P values and the magnitude of confidence intervals, (3) the continuing absence of standards for descriptive boundaries of δ for “big” and ζ for “small” distinctions in quantitative significance, and (4) the distraction of both P values and confidence intervals from the challenge of setting scientific standards for those quantitative boundaries.
13.6.1.1 Goals of the Statistical Procedures — What do investigators (and readers) really want in a stochastic test? What questions would they really ask if the answers, somewhat in the format of a game of Jeopardy, were not previously supplied by P values and confidence intervals?
Probably the main questions would refer to the stability of the results and their range of potential variation. When two central indexes are contrasted, a first question would ask whether they arise from groups containing enough members to make the contrast stable. Although this question requires a definition or demarcation of stability, the idea of stability is not a part of conventional statistical discourse, and no quantitative boundaries have been considered or established. Consequently, the currently unanswerable question about stability is usually transferred to an answerable question about probability. An α boundary is established for the level of probability, and the results are regarded as “statistically significant” when P ≤ α or a 1 – α confidence interval has suitable borders. The conclusion about statistical significance is then used to infer stability.
If boundaries were directly demarcated for stability, however, questions about a range of potential variation could be referred to those boundaries. For example, an observed distinction that seems to be “big” could be accepted as stable if the potential variations do not make it become “small.” A distinction that is “small” could be accepted as stable if it does not potentially become“big.” For these key decisions, however, investigators would have to set quantitative boundaries for big and small. The boundaries might vary for the context of different comparisons but could be established for appropriate circumstances. If these boundaries were given priority of attention, the investigators could then choose a method—fragility relocations, parametric theory, bootstrap resampling, etc.—for determining a range of potential variation.
Probably the greatest appeal of the purely probabilistic approach is that it avoids the scientific challenge of setting descriptive boundaries. The stochastic α boundary, which refers to probability, not to substantive magnitudes, becomes the basis for evaluating all quantitative distinctions, regardless of their scientific components and contexts. With a generally accepted α level, such as .05, the stochastic decision can promptly be made from a P value or from a 1 − α confidence interval. As noted earlier and later, the .05 boundary may be too inflexible, but it serves the important role of providing a “standard criterion.”
When the range of potential variation is examined with confidence intervals, however, the data analyst may sometimes make an ad hoc “dealer’s choice” for the selected level of α . The extremes of potential variation (denoted by the ends of the constructed confidence intervals) can then be interpreted with individual ad hoc judgments about what is big or small. Without established standard criteria for either a stochastic or quantitative boundary, confidence intervals can then be used, as discussed in Section 13.6.1.3, to achieve almost any conclusion that the data analyst would like to support.
These problems will doubtlessly persist until investigators reappraise the current statistical paradigm for appraising probability rather than stability. If the basic question becomes “What do I want to know?” rather than “What can I do with what is available?” the answer will require new forms of inferential methods aimed at quantitative rather than probabilistic boundaries.
© 2002 by Chapman & Hall/CRC
13.6.1.2 Reciprocal Similarity of P Values and Confidence Intervals — W h e n α i s used as a stochastic boundary, the reciprocal similarity of P values and confidence intervals can be shown from the construction and “location” of the concomitant stochastic hypothesis. In ordinary stochastic tests of a two-group contrast, the “location” of the tested hypothesis can be cited as ∆ . For the conventional
null hypothesis, symbolized as H0, the value assigned to ∆ is 0. Thus, in a contrast of two means XA and XB , with parameters µA and µB, the hypothesis is H0: ∆ = µA − µB = 0.
In the conventional 2-group Z test, after a boundary is chosen for α , a “decision interval” for twotailed P values is constructed around the null hypothesis value of 0, extending from –Z α to +Zα . If do is the observed difference for XA − XB , and if SED is its standard error, the critical ratio for the observed Zo is calculated as Zo = do/(SED). The null hypothesis is conceded if Zo falls inside the interval of 0 ± Zα , and is rejected if Zo is outside. Thus, a two-tailed P value will achieve stochastic significance if |Zo|
>Zα .
The confidence interval for comparing two groups is constructed not around ∆ = 0, but around the
observed value of do. After α is chosen and Zα is determined, the 1 – α interval is calculated as do ± Zα (SED). The null hypothesis is conceded if its value of 0 falls inside this interval and is rejected if 0 is outside. Thus, the stochastic requirement for “significance” is |do| − Zα (SED) > 0. When both sides of the latter expression are divided by SED to express the result in standard error units, the symbolic formulation becomes |do /SED| − Zα > 0, which is |Zo| − Zα > 0 or |Zo| > Zα — the same demand as previously.
Because the requirement for stochastic significance is exactly the same whether Zo is converted to a confidence interval or to a P value, confidence intervals are essentially a type of “reciprocal” for P values. For the P value, we calculate |do|/SED = Zo and compare Zo directly with Zα . For the confidence interval, we calculate |do| − Zα (SED) and compare the result against 0.
13.6.1.3Scientific Standards for Descriptive Boundaries — When extending from |do| −
Zα (SED) to |do| + Zα (SED), the confidence interval demonstrates how large |do| might really be. Beyond their stochastic roles, the two ends of the confidence interval might therefore make an extra contribution
to quantitative decisions about importance or “clinical” significance for the observed results. During most of the latter half of the 20th century, however, this type of quantitative significance has been restricted to stochastic rather than substantive decisions. When scientists begin to exhume or reclaim significance from its mathematical abduction, they will realize, as discussed in Chapter 10, that quantitative decisions depend on purely descriptive rather than inferential boundaries. Therefore, a descriptive
scientific question cannot be suitably answered when the product of an inferential choice, Zα , multiplies an inferential estimate of SED, to form an arbitrary mathematical entity, Z α (SED). Aside from obvious problems when the Gaussian Zα and SED do not adequately represent “eccentric” data, the doubly inferential product gives no attention to the investigator’s descriptive concepts and goals.
As crucial scientific boundaries for quantitatively significant and insignificant distinctions, respectively, δ and ζ must be chosen with criteria of scientific comparison, not mathematical inference.
Consequently, the stochastic intervals constructed with Zα and SED calculations do not address the fundamental substantive issues, and attention to the stochastic results may disguise the absence of truly
scientific standards for the quantitative boundaries.
13.6.1.4Potential Deception of Confidence Intervals — One immediate problem produced by the absence of standards is that an investigator who wants to advocate a “big” difference may focus on the upper end of the confidence interval, ignoring the lower end, which might extend below the null hypothesis level of 0 (or 1, for a ratio). Although the observed distinction would not be stochastically significant and might be dismissed as too unstable for serious attention, the investigator may nevertheless claim stochastic support for the “big” difference. This type of abuse evoked a protest by Fleiss,3 who said the problem would be prevented if P values were always demanded. If both ends of the confidence interval are always reported, however, and if the lower end goes beneath the nullhypothesis boundary, readers of the published results can readily discern the potential deception of a unilateral focus.
©2002 by Chapman & Hall/CRC
A more substantial deception, against which the reader has no overt protection, occurs when the confidence interval itself has been “rigged” to give the result desired by the investigator. This type of “rigging” is readily achieved if α is abandoned as a stochastic criterion, so that a fixed level of α is no longer demanded for a 1 − α confidence interval. If nonspecific “confidence intervals” can be formed arbitrarily, a zealous analyst can choose whatever level of α is needed to make the result as large or small as desired.
To illustrate this point, suppose the observed increment is do = 8.0, with a standard error of SED = 4.0. To make the confidence interval exclude 0, we can choose a relatively large α such as .1. With Z.1 = 1.645, the component for the two-tailed 90% confidence interval will be 1.645 × 4.0 = 6.58, and the interval, constructed as 8.0 ± 6.58, will extend from 1.42 to 14.58, thereby excluding 0. To make the interval include 0, however, we can choose a smaller α , such as .02. With Z.02 = 2.32, the twotailed 98% component will be 2.32 × 4 = 9.28. Constructed as 8.0 ± 9.28, the confidence interval will include 0 by extending from −1.28 to 17.28. To give the interval an unimpressive upper end, let α =
.5. The 50% confidence interval constructed with Z.5 = .674, will now be 8.0 ± 2.7, extending only to 10.7. To give the interval a much higher upper end, let α = .001. The 99.9% confidence interval, with
Z.001 = 3.29, will then be 8.0 ± 13.16, and its upper end will exceed 21.
As a customary standard for stochastic decisions, α = .05 may have all the invidious features of any arbitrary choice, but it has the laudable virtue of being a generally accepted criterion that cannot be altered whenever, and in whatever direction, the data analyst would like. Unless the proponents of confidence intervals agree on using a fixed (although suitably flexible) boundary for α , stochastic decisions will become an idiosyncratic subjective process, based on the analyst’s personal ad hoc goals for each occasion.
Until editors and investigators reach a consensus on this matter, you should be wary of any confidence interval that is reported as merely a “confidence interval.” If not accompanied by an indication of 1 – α (such as 95%, 90%, 50%), the uninformative result may be deceptive. If the 1 − α level is reported but departs from the customary 95%, a suitable (and convincing) explanation should be provided for the choice.
13.6.2Coefficient of Stability
A commonsense approach, which avoids all of the mathematical manipulations, is to make up your own mind by examining the ratio, SED/do, which is the two-group counterpart of the coefficient of stability for the central index of one-group. In previous discussions, the coefficient for one group seemed “stable” if its value was ≤ .1. For the greater variability that might be expected in two groups, this criterion level might be raised to a higher value such as .3, .4, or .5. The criterion can then be used to decide whether the increment (or other distinction) in central indexes for the two compared groups is relatively stable or unstable. As you evaluate the ratio, SED/do, for the coefficient of stability in two groups, however, remember that its inverse, do /SED, produces the value of Zo used for determining P values, and that P < .05 when Zo ≥ 2. If you feel assured by α = .05, then “stability” seems likely when SED/do is ≤ .5. If you want a much smaller SED/do ratio to persuade you of stability, the corresponding Zo values will be much larger than 2, and the corresponding P values will become .01 or .001. In the example just described, with do = 8 and SED = 4, the coefficient of potential variation is .5, which is just at the border of “stochastic significance” or “stability.”
To do this type of assessment, however, the SED, and not just the confidence interval, would always have to be cited as a necessary part of truth in reporting.
13.7 Z or t Tests for Paired Groups
Section 7.8.2.2 was concerned with the “paired one-group t test” that can be done when individual members of the two groups are paired, usually in a before-and-after arrangement of data values. Sometimes, however, the data may represent results of a “cross-over” study where each person received
© 2002 by Chapman & Hall/CRC
treatment A and then treatment B. In yet other situations, two parts of the same person may be paired for comparing therapeutic responses. (Suncream A may be tested on the right arm and Suncream B on the left arm.) In all of these “paired” circumstances, two groups of data are collected, but they can readily be converted to a single group and tested with the one-group (or “one-sample”) rather than two-group test.
For this procedure in Section 7.8.1, the pairs of individual data for variable V and variable W were converted to a single group of differences, di = Vi − Wi, for which we found the mean difference ( d ), the group variance (Sdd), the standard deviation (sd), and the standard error (sd/
n ). We then calculated and interpreted
t (or Z) = |
d |
|
s---------d / ----n |
||
|
13.7.1Effect of Correlated Values
The main reason for converting the data and doing a one-group rather than two-group t (or Z) test is that the two-group test makes the assumption that the two sets of data are independent for each group. In a crossover, before-and-after, or analogous “paired” arrangement, however, each person’s second value is not independent; it will usually be correlated with the first value. The explanation for this effect was shown in Appendixes 7.1 and 7.2. For a difference in two paired variables, {Wi} and {Vi}, the group variance is SWW − 2SWV + SVV. If the two variables are independent, the covariance term SWV ~ 0. If the variables are related, however, SWV has a definite value and its subtraction will produce a substantial reduction in group variance for the paired values.
For example, consider the following before-and-after values for three people:
|
“Before” |
“After” |
|
Person |
Value |
Value |
Increment |
|
|
|
|
1 |
160 |
156 |
4 |
2 |
172 |
166 |
6 |
3 |
166 |
161 |
5 |
|
|
|
|
The values of the mean and standard deviation for the Group B (“before”) values, will have XB = (160 + 172 + 166)/3 = 166 and sB = 6. For the Group A (“after”) values, the corresponding results will
be XA = (156 + 166 + 161)/3 = 161 and sA = 5. The values of group variance would be Sbb = (nB − 1) sB2 = 72 and Saa = (nA − 1) sA2 = 50. The increment in the two means will be 166 − 161 = 5; the pooled group variance will be 72 + 50 = 122; and its stochastic average will be 122/[(3− 1) + (3 − 1)] = 30.5,
so that the pooled standard deviation, s p, is 30.5 = 5.52. As a separate single group of data, however, |
|||
the paired increments have the same incremental mean, |
X |
d |
= (4 + 6 + 5)/3 = 5, but much lower values |
for variance: Sdd = 2 and sd = 1. In this instance Sdd = Saa |
− 2 Sab + Sbb; and Sab can be calculated as |
||
ΣXAi XBi − N XA XB = 80238 − 3 (166)(161) = 60. The value for Sdd becomes 50 − 2 (60) + 72 = 2. Because sd becomes much smaller than sp, the paired t (or Z) test is more likely to achieve stochastic
significance than the “unpaired” or two-group test. In the foregoing example, the paired calculation for t is 5/( 1 ⁄
3 ) = 5/.58 = 8.66. The unpaired or two-group calculation is t = (5/5.52)(
(3)(3)/6 ) = 1.11. The paired result would be stochastically significant at 2P < .05, but the unpaired two-group result would not.
13.7.2Paired vs. Two-Group Arrangements
Because the paired one-group test is more likely to produce stochastic significance than the two-group test, you may wonder why investigators seeking the accolade of stochastic “significance” do not always use the paired test. Why bother with the two-group test?
The answer to this question is that most sets of data for two groups cannot be individually paired. In most studies, the members of the two groups were obtained or treated separately, i.e., “independently,”
© 2002 by Chapman & Hall/CRC
and the two group sizes are usually unequal, i.e., nA ≠ nB. A pairing can be done, however, only when the research has been specifically designed for a paired arrangement of data in the individual persons under investigation.
The reduction in variance that allows stochastic significance to be obtained more easily with the paired individual arrangement also allows the use of smaller groups. This distinction is responsible for the avidity with which investigators try to do “crossover studies” whenever possible. Unfortunately, for reasons noted elsewhere,4 such studies are seldom possible — and even when they are, a formidable scientific problem can arise if the effects of Treatment A are “carried over” when the patient transfers to Treatment B.
13.7.3Arbitrary Analytical Pairings
The pairing tactic is sometimes applied to check or confirm results noted with a two-group comparison. Using a type of resampling, the analyst randomly selects members of Group A to be paired with randomly selected members of Group B. Results are then calculated for the randomly selected “paired sample,” and many such samples can be constructed to see if their results verify what was previously found for the two groups. The procedure, which keeps each group intact, differs from the combination of both groups that precedes a Pitman-Welch type of permutation resampling.
In the days before modern forms of multivariable analysis, a paired arrangement of members of two groups could be used to “match” for other pertinent variables beyond the main variable under analysis. For example, as part of the original research used to indict cigarette smoking as a cause of early mortality, the Surgeon General’s Committee5 asked Cuyler Hammond to reanalyze the original data of his longterm cohort study,6 and to compare mortality rates for “a ‘matched pair’ analysis, in which pairs of cigarette smokers and non-smokers were matched on height, education, religion, drinking habits, urbanrural residence and occupational exposure.”5
13.8 Z Test for Two Proportions
Although commonly used for two means, the Z test can also be applied to evaluate a difference in two proportions, pA = rA/nA and pB = rB/nB. Because the variance of a proportion is pq, we can use Formula [13.1] to substitute appropriately and find that the variance of the increment, pA − pB, will be (pAqA/nA) + (pBqB/nB). Under the null hypothesis that π A = π B, we estimate a common value for π as P = (nApA + nBpB)/N = (rA + rB)/(nA + nB). The estimated variance (or squared standard error) of the difference will then be (PQ/nA) + (PQ/nB) = NPQ/nAnB. The test statistic will be calculated as
Z = ----------------------------pA – pB |
= p-----A----------– p-B × |
n-----------A nB |
[13.16] |
NPQ/nA nB |
PQ |
N |
|
You may already have recognized that this formula is an exact counterpart of [13.13] for dimensional data. The standardized increment, which is (pA − pB)/
PQ, is multiplied by the group-size factor, which is
nA nB /N.
To illustrate the calculations, consider the contrast of 18/24 (75%) vs. 10/15 (67%). The common value for P would be (18 + 10)/(24 + 15) = .718 and Q = 1 − P = .282. The standardized increment will be (.75 − .67)/
(.718 )(.282) = .178. The group size factor will be 
(15 × 24 )/39 = 3.04 and the product for Z will be (.178)(3.04) = .54. The result will not be stochastically significant if α = .05.
13.8.1Similarity to Chi-Square Test
As noted later in Chapter 14, the value of Z in Formula [13.16] is the exact square root of the test statistic obtained when the chi-square test is applied to a contrast of two proportions. Accordingly, although the chi-square test is more conventional, the Z test can be used just as effectively for comparing two proportions.
© 2002 by Chapman & Hall/CRC
