
Словари и журналы / Психологические журналы / p169British Journal of Mathematical and Statistical Psycholo
.pdf
169
British Journal of Mathematical and Statistical Psychology (2002), 55, 169–175
© 2002 The British Psychological Society
www.bps.org.uk
Comparing the variances of two independent groups
Rand R. Wilcox*
University of Southern California, USA
Comparing the variances of two independent groups is a classic problem. Methods for comparing alternative measures of dispersion have been derived that perform well in simulations, but all methods for comparing variances have been found to be unsatisfactory in terms of Type I error probabilities or power. Currently, it appears that the most successful method, in terms of controlling the probability of a Type I error, is based on the so-called mean half-square successive difference statistic where a bootstrap is used to estimate its standard error, and a calibrated bootstrap is used to control the probability of a Type I error. But the mean half-square successive difference statistic is not permutation-invariant, and it is fairly evident that power can be relatively poor. A slight modiŽ cation of the percentile bootstrap method for computing a 1 ± a conŽ dence interval for the difference between the variances is considered that corrects these problems.
1. Introduction
Let j21 and j22 be the population variances corresponding to two independent groups. A classic problem is testing
H : j2 |
= |
j2 |
(1) |
|
0 |
1 |
|
2 |
|
versus H1 : j21 Þj22, or, if one prefers, computing a 1 ± a conŽdence interval for j21 ± j22 or j21/j22. Dozens of methods have been proposed, but all of them are known to be unsatisfactory in terms of controlling Type I error probabilities or power. Many methods based on reasonable alternative measures of dispersion have been derived that perform well in simulations. It is not being argued that they have no practical value. But when there is an explicit interest in comparing variances, it is readily veriŽed that these
*Requests for reprints should be addressed to Rand Wilcox, Department of Psychology, University of Southern California, Los Angeles, CA 90089-1061, USA.

170 Rand R. Wilcox
techniques are inadequate because they are not based on estimates of the population variances.
For example, Conover, Johnson, and Johnson (1981) compared 56 tests for the more general situation where the goal is to test the hypothesis of equal variances among J $2 independent groups. The only procedures that provided satisfactory control over the probability of a Type I error were based on measuring distances from the median rather than the mean. To elaborate, let Xij represent the ith observation randomly sampled from the jth group (i = 1, . . . , nj ; j = 1, 2). One example is the Brown and Forsythe (1974) test which applies Student’s t test to
Zij = | Xij ± Mj| ,
where Mj is the median associated with the jth group. (For more than two groups
they use the ANOVAF test.) Although E(| Xij ± Mj| ) is a reasonable measure of variation, |
|||
¯ |
Zij /nj does not estimate |
2 |
in general. So if there is explicit |
it is evident that Zj = |
jj |
P
interest in comparing variances, the Brown–Forsythe test can be biased, yield poor control over the probability of a Type I error, and have unsatisfactory probability coverage (illustrations can be found in Wilcox, 1990). This is not to say that the Brown– Forsythe test has no value, only that if there is explicit interest in comparing variances, it is unsatisfactory. For a recent modiŽcation of this method where Mj is replaced by a weighted likelihood estimate of the means, see Sarkar, Kim, and Basu (1999). For a recentlyproposed method for comparing a measure of scale based on trimming, see Hall and Padmanabhan (1997).
Additional methods for comparing variances are reviewed and compared by O’Brien (1978, 1979) and Wilcox (1990). Included in Wilcox’s study are the methods proposed by O’Brien (1979) and Tiku and Balakrishnan (1984), as well as an approach based on Gini’s measure of dispersion (see David, 1968). In simulations, some methods perform well in terms of Type I errors when distributions are identical, but they are unsatisfactory when distributions differ in shape. So any conŽdence interval based on these methods can be unsatisfactory. In fact, among the methods considered by Wilcox, the only one that performed well when distributions differ is based on a Box –Scheffe´ test (Box, 1953; Scheffe´ , 1959) but with Student’s t replaced by Welch’s (1938) heteroscedastic test for equal means. (Student’s t is not asymptotically correct under general conditions described by Cressie & Whitford, 1986). This method performed well with equal sample sizes, but with unequal sample sizes even it provides unsatisfactory probability coverage.
More recently, Wilcox (1992) considered several new methods for comparing variances. Two were based on Welch’s (1938) test used in conjunction with results in Conover et al. (1981) and O’Brien (1978, 1979), but control over the probability of a Type I error was unsatisfactory in simulations. Several bootstrap methods were used in an attempt to obtain better results, but none performed well.
Wilcox (1992) found only one method that proved to be successful in simulations in terms of Type I errors and probability coverage. Labelled method SD, it is based on the so-called mean half-square successive difference statistic given by
rj = |
nj ± 1 (X |
± |
X )2 |
|
Xi = 1 |
2(nj |
± |
1) . |
|
|
|
i + 1, j |
|
ij |
Abootstrap estimate of Vj = Var(rij) was used, say Vˆj, after which a bootstrap estimate

Comparing variances 171
of the distribution of
H = pr1 ± r2 • Vˆ 1 + Vˆ2
was used to compute a conŽdence interval for j21 ± j22. Because E(rj) = j2j , method SD satisŽes the basic goal of being based on a statistic that estimates j2j . But it is evident that rj is not permutation-invariant. Although rj is more efŽcient than the estimate of j2j used by the Box –Scheffe´ test, it is also evident that rj is relatively inefŽcient.
Let s2j be the usual sample variance associated with the jth group ( j = 1, 2). The moments of s2j can be estimated using well-known results (see Srivastava &Chan, 1989). However, based on these estimates, it seems that no satisfactory method has been found for testing (1), or computing a conŽdence interval for j2 ± j22 or j2/j22, and many attempts by the author have failed to Žnd a satisfactory method based on this approach.
2. Description of the new method
Asimple approach to testing (1) is with the percentile bootstrap method, but Srivastava and Chan (1989) have shown that it can be unsatisfactory when sampling from a heavytailed distribution. However, for the problem of comparing the variances of dependent groups, a modiŽed percentile bootstrap method, used in conjunction with the Morgan– Pitman test (Morgan, 1939; Pitman, 1939), has been found to perform relatively well in simulations (Piepho, 1997). This suggests that a similar modiŽcation might prove to be useful for the problem at hand, and the simulation results in Section 3 support this speculation.
Let Xij (i = 1, . . . , nj ; j = 1, 2) be a bootstrap sample obtained from the jth group. That is, Xij is obtained by resampling with replacement nj observations from X1j, . . . , Xnj j. Compute the sample variance for each of these two bootstrap samples and label the difference d. Repeat this B times, yielding d1 , . . . , dB. The conventional percentile bootstrap method uses the middle 95%of these B-values as a 0.95 conŽdence interval for j21 ± j22. To illustrate the results in Srivastava and Chan (1989), suppose observations are sampled from the contaminated normal distribution
H(x) = 0.9F(x) + 0.1F± x ²,
10
where F(x) is the standard normal distribution. Then with n1 = n2 = 20, B = 600 and a = 0.05, the actual Type Ierror probabilityis approximately0.118, and using B = 1000 makes a negligible difference. Increasing n to 40, the actual Type I error probability is 0.086.
However, the results in Piepho (1997) suggest that a simple modiŽcation of the percentile bootstrap method might provide fairly stable control over the probability of a Type I error across a reasonably wide range of distributions when sample sizes are small. First consider the equal sample size case, and for convenience let n represent the common sample size. The adjustment used here is the same as the adjustment considered by Piepho (1997). In particular, set B = 599 and choose values for L and U in the following manner. For n < 40, L= 7 and U = 593; for 40 #n < 80, L= 8 and U = 592; for 80 #n < 180, L= 11 and U = 588; for 180 #n < 250, L = 14 and U = 585; and for n $250, L = 15 and U = 584. Then
(d(L), d(U)) |
(2) |

172 |
Rand R. Wilcox |
is |
taken to be an approximate 0.95 conŽdence interval for j12 ± j22, where |
d (1) #... #d (B). This will be called method MPB. Note that this method becomes the |
|
standard percentile bootstrap procedure when n $250. (The reason for using B = 599 |
rather than 600 stems from results in Hall, 1986, showing that it is advantageous to choose B such that a is a multiple of (B + 1)± 1, and here attention is focused on a = 0.05). Some consideration was given to B = 1199 (with appropriate adjustments to L and U), but in terms of probability coverage, this has not been found to offer any advantage (cf. Booth & Sarker, 1998).
As a partial check on method MPB, it is noted for the contaminated normal distribution described above, and n = 20, that the Type I error probability drops from 0.118 to 0.071 when switching from the percentile bootstrap method to the adjusted method just described. For n = 40 it drops from 0.086 to 0.050. (More extensive simulation results are given in Section 3.)
There remains the problem of how to deal with unequal sample sizes. A natural strategy is to take some value m between n1 and n2, such as m = (n1 + n2)/2, and use m as the sample size when choosing L and U as previously described. It soon became evident, however, that this approach is unsatisfactory in simulations. Setting m = min(n1, n2) improved matters considerably, but for certain pairs of distributions again unsatisfactory results were obtained in simulations. For example, if n1 = 20, n2 = 80, and sampling is from the contaminated normal distribution, the actual probability of a Type I error is approximately 0.168 when testing at the 0.05 level. Only one method has been found to be remotely satisfactory in terms of Type I error probabilities and probability coverage. Generally the method is better than method SD, but it will become clear that it is not optimal. The strategy is to exploit the fact that method MPB performs well with equal sample sizes. When taking a bootstrap sample, rather than resampling n1 observations from group 1 and n2 observations from group 2, simply resample m = min(n1, n2) observations from both groups and use m as the sample size when choosing L and U. For n1 = 20, n2 = 80 and when sampling from the contaminated normal, the actual Type I error probability drops from 0.168 to approximately 0.067.
3. Simulation results
This section reports simulation results when using the conŽdence interval given by
(2). First consider the equal sample size case. The Žrst set of simulations was performed by generating observations from various generalized lambda distributions (Ramberg, Tadikamalla, Dudewicz, & Mykytka, 1979). The inverse of this distribution is
F± 1( p) = l1 + |
pl3 ± (1 ± p)l4 |
, |
|
l2 |
|||
|
|
where the l-values determine the Žrst four moments. Let mk = E(X ± m)k and ak = |
||||
m |
k |
/jk. So a |
3 |
and a are standard measures of skewness and kurtosis. Here four types of |
|
|
4 |
distributions are considered and are summarized in Table 1. The case a3 = 0 and a4 = 3 corresponds to a distribution having the same Žrst four moments as the standard normal distribution. For all four distributions in Table 1, the variance is one and all possible pairs of distributions were considered for the two groups being compared. Power was studied by multiplying the observations in the second group by d, where d = p2•or 2.

Comparing variances 173
Table 1. The l-values used in the simulations
a1 |
a2 |
l2 |
l3 |
l4 |
0.00 |
0.0 |
0.1974 |
0.1349 |
0.1349 |
0.00 |
9.0 |
± 0.3203 |
± 0.1359 |
± 0.1359 |
1.00 |
9.0 |
± 0.2356 |
± 0.0844 |
± 0.1249 |
2.00 |
15.6 |
± 0.2532 |
± 0.0646 |
± 0.1472 |
|
|
|
|
|
Table 2 shows a portion of the simulation results where a3j and a4j are the values of a3 and a4 for the jth group. (The results for method SDare taken from Wilcox, 1992, and are repeated here to add perspective.) The estimates are based on 1000 replications, so if a = 0.05, the estimate of a has a standard error of 0.0069 (cf. Robey & Barcikowski, 1992). The highest estimated Type I error probability (d = 1) among all of the simulations based on the generalized lambda distribution, not just those shown in Table 2, was 0.043 with n = 20, and the lowest estimate was 0.021. As can be seen, method MPB nearly dominates method SD in terms of power for the situations covered in Table 2. There are exceptions, but this occurs where the actual level of method SD is about twice that of method MPB.
Table 2. Estimated Type I error probability and power, a = 0.05 |
|
|
|
|
||||||||
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
d = 1 |
|
d = p2• |
|
d = 2 |
|||
a31 |
a41 |
a32 |
a42 |
n |
SD |
MPB |
|
SD |
MPB |
|
SD |
MPB |
0 |
3 |
0 |
3 |
20 |
0.021 |
0.032 |
0.154 |
0.344 |
0.420 |
0.835 |
||
0 |
3 |
0 |
3 |
40 |
0.026 |
0.021 |
0.312 |
0.551 |
0.775 |
0.988 |
||
0 |
3 |
0 |
9 |
20 |
0.075 |
0.042 |
0.187 |
0.163 |
0.435 |
0.596 |
||
0 |
3 |
0 |
9 |
40 |
0.048 |
0.034 |
0.343 |
0.432 |
0.757 |
0.784 |
||
0 |
3 |
1 |
9 |
20 |
0.053 |
0.043 |
0.089 |
0.175 |
0.243 |
0.615 |
||
0 |
3 |
1 |
9 |
40 |
0.051 |
0.031 |
0.173 |
0.311 |
0.511 |
0.890 |
||
0 |
3 |
2 |
15.6 |
20 |
0.071 |
0.041 |
0.206 |
0.144 |
0.450 |
0.646 |
||
0 |
3 |
2 |
15.6 |
40 |
0.068 |
0.038 |
0.366 |
0.264 |
0.735 |
0.846 |
||
|
|
|
|
|
|
|
|
|
|
|
|
|
Table 3 reports a portion of the simulation results when the sample sizes differ. Generally the size of the test is reasonably close to the nominal level, and the estimated Type I error probabilities are fairly stable among the situations considered.
4. Concluding remarks
It is well known that the percentile bootstrap method is relatively unstable when making inferences about means when the sample sizes are small (e.g., Efron and Tibshirani, 1993). With the adjustments described here, it becomes reasonably stable across distributions for the problem at hand, but inferences about means remain unsatisfactory. The main point is that, compared to other methods aimed speciŽcally

174 Rand R. Wilcox
Table 3. Estimated Type I error probability with unequal sample sizes, a = 0.05
a31 |
a41 |
a32 |
a42 |
n1 |
n2 |
aˆ |
0 |
3 |
0 |
3 |
20 |
40 |
0.043 |
0 |
3 |
0 |
9 |
20 |
40 |
0.029 |
0 |
3 |
0 |
9 |
40 |
20 |
0.032 |
0 |
3 |
1 |
9 |
20 |
40 |
0.033 |
0 |
3 |
1 |
9 |
40 |
20 |
0.031 |
0 |
3 |
2 |
15.6 |
20 |
40 |
0.030 |
0 |
3 |
2 |
15.6 |
40 |
20 |
0.041 |
|
|
|
|
|
|
|
at testing the hypothesis of equal variances, method MPB provides good control over the probability of a Type I error for a broader range of non-normal distributions and it has much better power properties than method SD. It would seem, however, that for unequal sample sizes there might be room for improvement. If the sample sizes differ substantially, the bootstrap method used here does not take full advantage of all the observations in the larger group. The extent to which this causes practical problems, and can be corrected, remain to be determined.
References
Booth, J. G., & Sarkar, S. (1998). Monte Carlo approximation of bootstrap variances. American Statistician, 52, 354–357.
Box, G. E. P. (1953). Non-normality and tests on variances. Biometrika, 40, 318–335.
Brown, M. B., & Forsythe, A. B. (1974). Robust tests for equality of variances. Journal of the American Statistical Association, 69, 364–367.
Conover, W. J., Johnson, M. E., & Johnson, M. M. (1981). A comparative study of tests for homogeneity of variances, with applications to the outer continental shelf bidding data.
Technometrics, 23, 351–361.
Cressie, N. A. C., & Whitford, H. J. (1986). How to use the two sample t-test. Biometrical Journal, 28, 131–148.
David, H. A. (1968). Gini’s mean difference rediscovered. Biometrika, 55, 573–574.
Efron, B., &Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman &Hall. Hall, P. (1986). On the number of bootstrap simulations required to construct a conŽdence
interval. Annals of Statistics, 14, 1431–1452.
Hall, P., & Padmanabhan, A. R. (1997). Adaptive inference for the two-sample scale problem.
Technometrics, 39, 412–422.
Morgan, W. A. (1939). Atest for the signiŽcance of the difference between the two variances in a sample from a normal bivariate population. Biometrika, 31, 13–19.
O’Brien, R. G. (1978). Robust techniques for testing heterogeneity of variance effects in factorial designs. Psychometrika, 43, 327–342.
O’Brien, R. G. (1979). A general ANOVA method for robust tests of additive models of variances.
Journal of the American Statistical Association, 74, 877–880.
Piepho, H.-P. (1997). Tests for equality of dispersion in bivariate samples: Review and empirical comparison. Journal of Statistical Computation and Simulation, 56, 353–372.
Pitman, E. J. (1939). A note on normal correlation. Biometrika, 31, 9–12.
Ramberg, J. S., Tadikamalla, P. R., Dudewicz, E. J., & Mykytka, E. F. (1979). A probability distribution and its use in Žtting data. Technometrics, 21, 201–214.
Robey, R. R., & Barcikowski, R. S. (1992). Type I error and the number of iterations in Monte
Comparing variances 175
Carlo studies of robustness. British Journal of Mathematical and Statistical Psychology, 45, 283–288.
Sarkar, S., Kim, C., & Basu, A. (1999). Test for homogeneity of variances using robust weighted likelihood estimates. Biometrical Journal, 7, 857–871.
Scheffe´, H. (1959). The analysis of variance. New York: Wiley.
Srivastava, M. S., & Chan, Y. M. (1989). A comparison of bootstrap method and Edgeworth expansion in approximating the distribution of sample variance: One sample and two sample cases. Communications in Statistics: Simulation and Computation, 18, 339–361.
Tiku, M. L., & Balakrishnan, N. (1984). A robust test for testing the correlation coefŽcient.
Communications in Statistics: Simulation and Computation, 15, 945–971.
Welch, B. L. (1938). The signiŽcance of the difference between two means when the population variances are unequal. Biometrika, 29, 350–362.
Wilcox, R. R. (1990). Comparing variances and means when distributions have non-identical shapes. Communications in Statistics: Simulation and Computation, 19, 155–173.
Wilcox, R. R. (1992). An improved method for comparing variances when distributions have non-identical shapes. Computational Statistics & Data Analysis, 13, 163–172.
Received 11 October 2000; revised version received 14 May 2001