Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Ординатура / Офтальмология / Английские материалы / Principles Of Medical Statistics_Feinstein_2002

.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
25.93 Mб
Скачать

The log-of-factorial-permutations score has been used clinically15 for comparing different partitions of groups in the preparation of prognostic “staging” systems. For example, suppose a group of 200 people can be distributed as {19, 77, 82, 22} or as {20, 84, 72, 24}. The result of the log permutation score is 100.9 for the first partition and 102.4 for the second. If all other factors were equal, the second partition would be regarded as having a better (i.e., more diverse) distribution.

5.10.2Shannon’s H

An entity called Shannon’s H16 is derived from concepts of “entropy” in information theory. The idea refers to “uncertainty” in a distribution. If all cases are in the same category, there is no uncertainty. If the cases are equally divided across categories, the uncertainty is at a maximum.

The formula for Shannon’s index of diversity is

H = –Σ pi log pi

where pi is the proportion of data in each category. For patients in the three categorical stages in the example of Section 5.10.1, the respective proportions are .190, .286, and .524. Shannon’s index of diversity is

H = –[.190 log .190 + .286 log .286 + .524 log .524] = –[–.4396] = .4396

The last index cited here expresses qualitative variation from a heterogeneity score calculated as the sum of the products of all pairs of frequency counts. For three categories, this sum will be f 1 f2 + f1 f3 + f2 f3. In the example cited here, the score will be (4 × 6) + (4 × 11) + (6 × 11) = 134. The maximum heterogeneity score — with equal distribution in all categories — would occur as (7 × 7) + (7 × 7) + (7 × 7) = 147. The index of qualitative variation is the ratio of the observed and maximum scores. In this instance, it is 134/147 = .91.

Indexes of diversity are probably most useful for comparing distributions among groups, rather than for giving an absolute rating. Because the scores depend on the number of categories, the comparisons are best restricted to groups divided into the same number of categories. For the two groups of 200 people cited at the end of Section 5.10.1, the observed heterogeneity scores would respectively be 13251 and 13392. The higher result for the second group would indicate that it is more heterogeneous than the first, as noted previously with the permutation score.

References

1. Tukey, 1977; 2. Cleveland, 1994; 3. Wald, 1993; 4. Reynolds, 1993; 5. Wallis, 1956; 6. Snedecor, 1980; 7. Miller, 1993; 8. Spear, 1952; 9. McGill, 1978; 10. SAS Procedures Guide, 1990; 11. SPSSx User’s Guide, 1986; 12. Geary, 1936; 13. Royston, 1993a; 14. Royston, 1993b; 15. Feinstein, 1972; 16. Shannon, 1948.

Exercises

5.1.In one mechanism of constructing a “range of normal,” the values of the selected variable are obtained for a group of apparently healthy people. The inner 95% of the data is then called the “range of normal.” Do you agree with this tactic? If not, why not — and what alternative would you offer?

5.2.At many medical laboratories, the “range of normal” is determined from results obtained in a consecutive series of routine specimens, which will include tests on sick as well as healthy people. What would you propose as a pragmatically feasible method for determining the “range of normal” for this type of data?

©2002 by Chapman & Hall/CRC

5.3.A “range of normal” can be demarcated with a Gaussian inner zone around the mean, with an inner percentile zone, or with a zone eccentrically placed at one end of the data. Each of these methods will be good for certain variables, and not good for others. Name two medical variables for which each of these three methods would be the best statistical strategy of establishing a “range of normal.” Offer a brief justification for your choices. (Total of six variables to be named. Please avoid any already cited as illustrations.)

5.4.The quartile deviation is almost always smaller than the corresponding standard deviation. Why do you think this happens?

5.5. For the data in Table 3.1, the mean is 22.732 and the standard deviation (calculated with n 1) is 7.693.

5.5.1.What would have been the standard deviation if calculated with n?

5.5.2.What is the coefficient of variation for the data?

5.5.3.What is the quartile coefficient of variation?

5.6.What corresponding values do you get when an inner 95% zone is determined by percentiles and by a Gaussian demarcation in the data of Table 3.1? Why do the results differ?

5.7.Using whatever medical journal you would like, find three sets of data (each describing one variable in one group) that have been summarized with an index of central tendency and an index of spread. From what you know or have been shown about the data, are you content with these summary values? Mention what they are, and why you are content or discontent. Be sure to find at least one data set for which the published summary indexes have evoked your displeasure, and indicate why you are unhappy. As you read more of the published report where those indexes appeared, do you think they have led to any serious distortions in the authors’ conclusions? If so, outline the problem.

5.8.A colleague who wants to analyze categories rather than dimensional values has asked you to divide the data of Table 3.1 into five categories. What partition would you choose? How would you demonstrate its effectiveness?

© 2002 by Chapman & Hall/CRC

6

Probabilities and Standardized Indexes

CONTENTS

6.1Stability, Inference, and Probability

6.1.1“Perturbation” and Inference

6.1.2Probability and Sampling

6.2Why Learn about Probability?

6.3Concepts of Probability

6.3.1Subjective Probabilities

6.3.2Frequentist Probabilities

6.3.3Other Approaches

6.4Empirical Redistributions

6.4.1Principles of Resampling

6.4.2Standard Error, Coefficient of Stability, and Zones of Probability

6.4.3Bootstrap Procedure

6.4.4Jackknife Procedure

6.4.5“Computer-Intensive” Statistics

6.5Demarcating Zones of Probability

6.5.1Choosing Zones

6.5.2Setting Markers

6.5.3Internal and External Zones

6.5.4Two-Zone and Three-Zone Boundaries

6.5.5P-Values and Tails of Probability

6.5.6Confidence Intervals and Internal Zones

6.5.7To-and-from Conversions for Observations and Probabilities

6.6Gaussian Z-Score Demarcations

6.6.1One-Tailed Probabilities

6.6.2Two-Tailed Probabilities

6.6.3Distinctions in Cumulative Probabilities for Z-Scores

6.6.4Designated and Observed Zones of Probability

6.6.5Problems and Inconsistencies in Symbols

6.6.6Determining Gaussian Zones

6.7Disparities Between Theory and Reality

6.8Verbal Estimates of Magnitudes

6.8.1Poetic License

6.8.2Estimates of Amounts

6.8.3Estimates of Frequencies and Probabilities

6.8.4Virtues of Imprecision

References

Exercises

We now have methods to summarize a group of data by citing indexes of central location and spread. We can also examine box plots and relative spreads to indicate whether the data have a symmetric distribution, and whether the central index does a satisfactory job of representing the group. For decisions that one group is taller, heavier, or wealthier than another, the central indexes can be compared directly if they are satisfactory. If not, additional mechanisms will be needed for the comparison.

© 2002 by Chapman & Hall/CRC

The foregoing approach seems reasonable and straightforward, but it does not take care of a major problem that has not yet been considered: the stability of the indexes. When we compared 50% vs. 33% back in Chapter 1, each group was suitably represented by the proportions of 50% and 33%, but we were uneasy about the comparison if the actual numbers were 1/2 vs. 1/3, and relatively comfortable if they were 150/300 vs. 100/300.

6.1Stability, Inference, and Probability

The comparison of 1/2 vs. 1/3 left us uneasy because the summary indexes were unstable. If the first group had one more member, its result could be 1/3, producing the same 33% as in the second group. If the second group had one more member, its result might be 2/4, producing the same 50% as in the first group. On the other hand, the original comparison of 50% vs. 33% would hardly be changed if one more person were added to (or removed from) groups with results of 150/300 vs. 100/300.

Even if the groups are large, stability of the indexes is not always assured. For example, in two groups that each contain 1000 people, a ratio indicating that one group has twice the mortality rate of the other might seem impressive if the rates are 200/1000 vs. 100/1000, but not if they are 2/1000 vs. 1/1000.

For these reasons, statistical decisions require attention to yet another component — stability of the central index — that is not denoted merely by the index itself or by the spread of the data. The evaluation of stability involves concepts of probability and inference that will be discussed in this chapter before we turn to the mechanisms of evaluation in Chapters 7 and 8.

6.1.1“Perturbation” and Inference

Stability of a central index is tested with various methods of “perturbing” the data. The methods, which are discussed in Chapters 7 and 8, include such tactics as “jackknife” removal of individual items, “bootstrap” resampling from an empirical group, and “parametric” sampling from a theoretical population to determine a “confidence interval.”

The perturbations are used to answer the question, “What would happen if…?” Each set of results allows us to estimate both a range of possible values for the central index, and also a probability that the values will be attained. All of these estimates represent inference, rather than evidence, because they are derived from the observed data but are not themselves directly observed.

6.1.2Probability and Sampling

If a set of data has already been collected, it can be “perturbed” in various ways to make inferences about the probability of what might happen. A different and particularly important role of probability, however, occurs when we do not yet have any data, but plan to get some by a sampling process. For example, suppose we want to conduct a political poll to determine the preferences of a group of voters. How large a sample would make us feel confident about the results? The latter question cannot be answered with the tactic of perturbing an observed group of data, because we do not yet have any data available. The decision about the size of the sample can be made, however, with principles of probability that will be cited later.

Because probability has many other roles in statistics, this chapter offers an introductory discussion of its concepts and application. The idea of probability usually appears in statistics not for descriptive summaries of observed data, but in reference to inferential decisions. Nevertheless, probability has already been mentioned several times (in Sections 4.5 and 4.8) in connection with such descriptive indexes as percentiles and Z-scores. In addition, all of the inner zones discussed in Chapter 5 describe proportions of the observed data; and each proportion can be regarded as the probability that a randomly chosen item of data would come from the designated zone.

A particularly valuable attribute of zones of probability is their role as the main intellectual “bridge” between the descriptive and inferential strategies of statistical reasoning. An internal zone of probability

© 2002 by Chapman & Hall/CRC

is used, as noted later in Chapter 7, to construct boundaries for a confidence interval that shows possible variations in the location of a central index. An external zone of probability is used, as discussed here and in many subsequent chapters, to establish P values for the possibility that important distinctions in the observed results arose by chance alone.

6.2Why Learn about Probability?

Mathematical discussions of probability are usually the place at which most non-statisticians lose interest in statistics. The reader (or student) usually wants to learn about quantitative distinctions: how to examine the magnitudes of observed data in a group; how to contrast the results in two (or more) groups; how to check the associations among two (or more) variables. Seeking enlightenment about these pragmatic activities in the real world of statistical description, the reader wonders why the discussion has entered the theoretical world of probabilities and statistical inference. When the discussion begins to include long mathematical details about the “laws of probability,” principles of sampling, and other theoretical background, the reader is usually bored, repelled, or confused beyond redemption.

The purpose of these remarks is to indicate why probability is important, and to assure you that the forthcoming discussion will try to avoid many of the unattractive mathematical details.

The prime reason for being concerned about probability is that medical research often involves small groups and data that create uncertainty about unstable numerical results, such as those discussed at the beginning of this chapter. The instabilities can lead to potentially troublesome events. A political candidate who discovers she is favored by 67% of the voters may reduce her campaigning efforts because she feels sure of winning the election. If the value of 67% reflects the preference of only 4 of 6 people, however, the decision might be disastrous. If we conclude that treatment A is definitely better than treatment B because the respective success rates are 50% vs. 33%, the conclusion might be dramatically wrong if the actual observed numbers were only 4/8 vs. 2/6.

In considering the consequences of numerical instability, we use probabilities to determine the chance of occurrence for potentially distressing events. If the probability is small enough, we might stop worrying. Thus, if the political poll in a properly conducted survey shows that the candidate is favored by 40 of 60 voters, the probability is about one in a hundred (as is demonstrated later) that the observed result of .67 represents a voting population whose preference for that candidate might actually be as low as .51. With this small a chance of losing, the candidate — particularly if running out of funds — may decide to reduce campaign efforts.

Not all statistical decisions involve appraisals of probability. When the first heart was successfully transplanted, or the first traumatically severed limb was successfully reattached, neither statistics nor probabilities were needed to prove that the surgery could work. No control groups or statistical tabulations were necessary to demonstrate success on the initial occasions when diabetic acidosis was promptly reversed by insulin, when bacterial endocarditis was cured by penicillin, or when a fertilized embryo was produced outside the uterus. No calculations of probability are used when a baseball team wins the pennant with a success proportion of 99/162 = .611, although this result seems but trivially larger than the 98/162 = .605 result for the team that finishes in second place.

Many eminent scientists have said, “If you have to prove it with statistics, your experiment isn’t very good,” and the statement is true for many types of research. If the experiment always produces the same “deterministic” result, no statistics are needed. The entity under study always happens or it does not; and the probability of occurrence is either 100% or 0%. For example, if we mix equal amounts of the colors blue and yellow, the result should always be green. If the color that emerges is purple, statistics are not needed to prove that something must have gone wrong.

If the anticipated results are not “deterministic,” however, probabilities will always be produced by the uncertainties. If told there is a 38% chance of rain today, we have to decide whether to take an umbrella. If told that a particular treatment usually has 80% failure, we might wonder about believing a clinician who claims three consecutive successes.

© 2002 by Chapman & Hall/CRC

The probabilities that can often be valuable for these decisions are also often abused. A surgeon is said to have told a patient, “The usual mortality rate for this operation is 90%; and my last nine patients all died. You are very lucky because you are the tenth.”

Many (if not most) decisions in the world of public health, clinical practice, and medical research involve statistical information that has an inevitable numerical uncertainty. The groups were relatively small or the data (even if the groups were large) had variations that were nondeterministic. In these circumstances, wrong decisions can be made if we fail to anticipate the effects of numerical instablity.

The world of probabilistic inference joins the world of descriptive statistics, therefore, because decisions are regularly made amid numerical instability, and because citations of probability are invalu - able components of the decisions. The discussion of probability in this chapter has three main goals:

1.To acquaint you with the basic principles and the way they are used;

2.To emphasize what is actually useful, avoiding details of the mathematical “ethos” that underlies the principles; and

3.To make you aware of what you need to know, and wary of unjustified tactics or erroneous claims that sometimes appear in medical literature.

Finally, if you start to wonder whether all this attention to probability is warranted for a single group of data, you can be reassured that all decisions about probability are based on a single group of data. When two or more groups are compared, or when two or more variables are associated, the probabilistic decisions are all referred to the distribution of a single group of data. That single group can be constructed in an empirical or theoretical manner, but its distribution becomes the basis for evaluating the role of probability in comparing the observations made in several groups or several variables. Thus, if you understand the appropriate characteristics of a single group of data, you will be ready for all the other probabilistic decisions that come later.

6.3Concepts of Probability

In many fields of science, the most fundamental ideas are often difficult to define. What should be intellectual bedrock may emerge on closer examination to resemble a swamp. This type of problem can be noted in medicine if you try to give unassailable, non-controversial definitions for such ideas as health, disease, and normal. In statistics, the same problem arises for probability. It can represent many different ideas, definitions, and applications.

6.3.1Subjective Probabilities

In one concept, probability represents a subjective impression that depends on feelings rather than specific data. This type of subjective guess determines the odds given for bets on a sporting event. The odds may be affected by previous data about each team’s past performance, but the current estimate is always a hunch. Similarly, if you take a guess about the likelihood that you will enjoy learning statistics, you might express the chance as a probability, but the expression is subjective, without any direct, quantitative background data as its basis.

6.3.2Frequentist Probabilities

In most statistical situations, probability represents an estimate based on the known frequency characteristics of a particular distribution of events or possibilities. These estimates, which are called frequentist probabilities, are the customary concept when the idea of “probability” appears in medical literature. The results depend on frequency counts in a distribution of data.

If the data are binary, and the results show 81 successes and 22 failures for a particular treatment, the success proportion of 81/103 = .79 would lead to 79% as the probability estimate of success. If the data

© 2002 by Chapman & Hall/CRC

are dimensional, as in the chloride information of Table 4.1 in Chapter 4, the distribution of results can lead to diverse probabilities based on relative and cumulative relative frequencies. If we randomly selected one chloride value from the series of items in that table, the chances (or probabilities) are .0780 that the item will be exactly 101, .0598 that it will lie below 93, .8598 (= .9196 .0598) that it will lie between 93107, and .0804 that it will be above 107.

Frequentist probabilities need not come from an actual distribution of data, however. They can also be obtained theoretically from the structure of a particular device, mechanism, or model. Knowing the structure of a fair coin, we would expect the chance to be 1/2 for getting a head when the coin is tossed. Similarly, a standard deck of 52 playing cards has a structure that would make us estimate 13/52 or 1/4 for the chance of getting a spade if one card is chosen randomly from the deck. The chance of getting a “picture card” (Jack, Queen, or King) is (4 + 4 + 4)/52 = 12/52 = .23.

Frequentist probabilities can also be constructed, as noted later, from modern new tactics that create empirical redistributions or resamplings from an observed set of data.

6.3.3Other Approaches

Just as human illness can be approached with the concepts of “orthodox” medicine, chiropractic, naturopathy, witchcraft, and other schools of “healing,” statistical approaches to probability have different schools of thought. The conventional, customary approach — which employs frequentist probability — will be used throughout this text. Another school of thought, however, has attracted the creative imagination of capable statisticians. The alternative approach, called Bayesian inference, often appears in statistical literature and is sometimes used in sophisticated discussions of statistical issues in medical research. Bayesian inference shares an eponym with Bayes theorem, which is medically familiar from its often proposed application in the analysis of diagnostic tests. Bayes theorem, however, is a relatively simple algebraic truism (see Chapter 21), whereas Bayesian inference is a highly complex system of reasoning, involving subjective as well as frequentist probabilities.

Yet another complex system for making estimates and expressing probabilities is the likelihood method which relies on various arrangements for multiplying or dividing “conditional” probabilities. Its most simple applications are illustrated later in the text as the likelihood ratios that sometimes express results of diagnostic tests, and as the maximum likelihood strategy used to find regression coefficients when the conventional “least squares” method is not employed in statistical associations.

6.4Empirical Redistributions

The frequentist probabilities discussed in Section 6.3.2 were obtained in at least three ways. One of them is empirical, from a distribution of observed data, such as the success proportion of .79 (=81/103), or the different values and frequencies in the chloride data of Table 4.1. The other two methods rely on theoretical expectations from either a device, such as a coin or a deck of cards, or from the “model” structure of a mathematical distribution, such as a Gaussian curve. Mathematical models, which will be discussed later, have become a traditional basis for the probabilities used in statistical inference.

The rest of this section is devoted to a fourth (and new) method, made possible by modern computation, that determines probabilities and uncertainties from redistributions or “resamplings” of the observed, empirical data.

6.4.1Principles of Resampling

The great virtue of a theoretical distribution is that it saves work. If we did not know whether a coin was “fair,” i.e., suitably balanced, we would have to toss it over and over repeatedly to tabulate the numbers of heads and tails before concluding that the probability of getting a head was 1/2. If we were asked to estimate the chance of drawing a spade from 100 well shuffled decks of cards, we could

© 2002 by Chapman & Hall/CRC

promptly guess 1/4. Without the theoretical “model,” however, we might have to count all 5200 cards or check results in repeated smaller samplings.

The Gaussian distribution discussed in Section 4.8.3 has a similar virtue. If we know that a particular distribution of data is Gaussian and if we are given a particular member’s Z-score, we can promptly use the Gaussian attributes of Z-scores to determine the corresponding percentile and probability values. If we could not assume that the data were Gaussian, however, we would have to determine the desired probabilites either by examining the entire group of data or by taking repeated samplings.

The repeated-sampling process, in fact, was the way in which certain mathematical model distributions were discovered. For example, the t-distribution (discussed later) was found when W. S. Gosset, who published his results under the pseudonym of “Student,” took repeated small samples from a large collection of measurements describing finger lengths and other attributes of criminals.1

With the modern availability of easy electronic computation, new methods have been developed to create distributions empirically as resampled rearrangements of an observed set of data. The method of forming these ad hoc empirical redistributions, without recourse to any theoretical mathematical models, uses bootstrap and jackknife strategies that will be briefly described here and further discussed in Chapter 7.

6.4.1.1Illustrative Example — To illustrate the resampling process, suppose we want to assess the stability of the mean in the three-member data set {1, 6, 9}, for which X = 16/3 = 5.33. We know

that the three items of data have come from somewhere, and that the source included items whose values were 1, 6, and 9. As discussed later, we can try to construct a hypothetical population that might be the source, and we can contemplate theoretical samples that might be drawn from that population. Alternatively, however, we can work with the real data that have actually been observed. We can regard the values of 1, 6, and 9 as the entire “parent population.” We can then take repeated samples of 3 members from this “population,” being careful to replace each member after it is drawn. We determine the mean of each sample and then note the distribution of those means. The variations in the array of means can then help answer the question about stability of the original mean.

To construct each sample, we randomly draw one member from the set of data {1, 6, 9}. After replacing it, we randomly draw a second member. We replace it and then randomly draw a third member. We then calculate the mean for those three values. The process is repeated for the next sample. As the sampling process continues, it will yield a series of means for each of the three-member sets of data. To denote variation in the distribution, we can determine the standard deviation and other features of that series of means.

6.4.1.2Mechanism of Redistribution — The sampling process itself can be avoided if we

work out the specific details of what might happen. Since three possibilities exist for each choice of a member from the 3-item data set, there will be 33 = 27 possible samples of three members. Those

samples, and their means, are shown in Table 6.1.

Table 6.2 shows the frequency distribution of the 27 means in Table 6.1. The distribution has a mode of 5.33, which is the same as the original mean of 5.33. The mean of the 27 means is also 5.33, calculated

as [(1 × 1.00) + (3 × 2.67) + (3 × 3.67) + + (3 × 8.00) + (1 × 9.00)]/27 = 143.98/27 = 5.33. The group variance, using the formula Σ X2i [(Σ Xi )2 n ], is [(1 × 1.002) + (3 × 2.672) + (3 × 3.672) +

+ (3 × 8.002) + (1 × 9.002)] [(143.98)2/27] = 865.700 767.787 = 97.913. For this complete set of 27 samples, the group variance is divided by n to calculate variance, which becomes 92.913/27 = 3.626. The square root of the variance is the standard deviation, 1.904.

6.4.2Standard Error, Coefficient of Stability, and Zones of Probability

The term standard error of the mean refers to the standard deviation found in the means of an array of repeated samples. In resamplings of the data set {1, 6, 9}, the mean value of 5.33 will have a standard deviation, i.e., standard error, of 1.904.

The more customary method of determining standard error is with theoretical principles discussed later. According to those principles, however, the standard error in a sample containing n members is σ /n,

© 2002 by Chapman & Hall/CRC

TABLE 6.1

All Possible Samples and Corresponding Means in a

3-Member Complete Resampling of the Set of Data {1, 6, 9}

Sample

Mean

Sample

Mean

Sample

Mean

 

 

 

 

 

 

{1, 1, 1} :

1.00

{6, 1, 1} :

2.67

{9, 1, 1} :

3.67

{1, 1, 6} :

2.67

{6, 1, 6} :

4.33

{9, 1, 6} :

5.33

{1, 1, 9} :

3.67

{6, 1, 9} :

5.33

{9, 1, 9} :

6.33

{1, 6, 1} :

2.67

{6, 6, 1} :

4.33

{9, 6, 1} :

5.33

{1, 6, 6} :

4.33

{6, 6, 6} :

6.00

{9, 6, 6} :

7.00

{1, 6, 9} :

5.33

{6, 6, 9} :

7.00

{9, 6, 9} :

8.00

{1, 9, 1} :

3.67

{6, 9, 1} :

5.33

{9, 9, 1} :

6.33

{1, 9, 6} :

5.33

{6, 9, 6} :

7.00

{9, 9, 6} :

8.00

{1, 9, 9} :

6.33

{6, 9, 9} :

8.00

{9, 9, 9} :

9.00

 

 

 

 

 

 

where σ is the standard deviation of the parent population. In this instance, the group of data {1, 6, 9} is a complete population, rather than a sample, and its standard deviation is calculated with n rather than

n – 1. Thus σ 2 = [(12 + 62 + 92) (162/3)]/3 = 10.89 and σ = 3.30. The value of

σ / n will be

3.30 3 = 1.905, which is essentially identical to the result obtained with resampling.

To evaluate stability, we can examine the ratio of standard error/mean.

TABLE 6.2

 

 

This ratio, which might be called the coefficient of stability, is a coun-

 

 

Frequency Distribution

terpart of the coefficient of variation, calculated as c.v. = standard devi-

of Means in Table 6.1

ation/mean. In this instance, the coefficient of stability (c.s.) will be

 

 

 

1.905/5.33 = .36. The interpretation of this index will be discussed later.

Value

Frequency

Intuitively, however, we can immediately regard the value of .36 as

 

 

 

1.00

1

 

indicating an unstable mean. We would expect a “stable” coefficient of

2.67

3

 

stability for the mean to be substantially smaller than an adequate coef-

3.67

3

 

ficient of variation for the data. Yet the value of .36 for c.s. is much

4.33

3

 

larger than the values of .10 or .15 proposed in Section 5.5.1 as a standard

5.33

6

 

6.00

1

 

of adequacy for the c.v.

 

6.33

3

 

Another approach to the array of resampled means in Table 6.2 is to

 

7.00

3

 

note that the value of the mean will range from 2.67 to 8.00 in 25 (92.6%)

 

8.00

3

 

of the 27 possible resamplings. Excluding the four most extreme values

9.00

1

 

at each end of Table 6.2, we could state that the mean will range from

Total

27

 

3.67 to 7.00 in 19/27 (=70.4%) of the resamplings. Expressed in proba-

 

 

 

 

 

 

bilities, the chances are .704 that the mean would be in the range 3.67 7.00, and .926 for the range of 2.67 8.00. There is a chance of 1/27 = .04 that the mean would be either as high as 9.00 or as low as 1.00.

This resampling activity may seem like a lot of needless work to get the same standard error of 1.904 that could have been found so easily with the theoretical σ /n formula. The resampling process, however, has several major advantages that will be discussed later. Two virtues that are already apparent are (1) it helps confirm the accuracy of the σ /n formula, which we would otherwise have to accept without proof; and (2) it offers an accurate account of the inner zone of possible variations for the resampled means. The inner zone from 2.67 to 8.00, which contains 92.6% of the possible values, is something we shall meet later, called a confidence interval for the mean. The range of this interval can also be determined with the theoretical tactics discussed later (Chapter 7), but the resampling process offers an exact demonstration.

6.4.3Bootstrap Procedure

The strategy of examining samples constructed directly from members of an observed distribution is called the bootstrap procedure. If the original group contains m members, each one is eligible to be a member of the resampled group. If each resampled group has n members, there will be mn possible samples. Thus, there were 33 = 27 arrangements of resamples containing 3 members from the

© 2002 by Chapman & Hall/CRC

three-member group. If each of the resampled groups contained only two members, there would be a total of 32 = 9 groups.

In a complete resampling, as in Section 6.4.1.2, all possible arrangements are considered — so that mm samples are produced. Thus, a complete resampling from the 20 items of data in Exercise 3.3 would produce 2020 = 1.05 × 1026 arrangements. A properly programmed computer could slog through all of them, but a great deal of time would be required. Consequently, the bootstrap approach seldom uses a complete resampling. Instead, the computer constructs a large set (somewhere between 500 and 10,000) of random samples, each taken from the original group of n members, with each member suitably replaced each time. The name Monte Carlo is sometimes used for the process of taking multiple random samples from an observed collection of data. The Monte Carlo resamples are then checked for the distributional patterns of means (or any other desired attributes) in the collection of samples.

With dimensional data, the bootstrap procedure is often applied for a limited rather than complete resampling of the original data, but the complete procedure is relatively easy to do if the data have binary values of 0/1. For binary data, mathematical formulas (discussed later in Chapter 8) can be used to anticipate the complete bootstrapped results.

6.4.4Jackknife Procedure

An even simpler method of determining the stability of the mean for the set of data {1, 6, 9} is to use a sequential removal (and restoration) strategy. For the three-item data set, the removal of the first member produces {6, 9}, with a mean of 7.5. After the first member is restored, removal of the second member produces {1, 9}, with a mean of 5.0. Removal of the third member leads to {1, 6}, with a mean of 3.5. Thus, removal of a single member can make the original mean of 5.33 change to values that range from 3.5 to 7.5. [Note that the mean of the rearrangements is (7.5 + 5.0 + 3.5)/3 = 16/3 = 5.33, which is again the value of the original mean.]

The strategy of forming rearrangements by successively removing (and then later restoring) each member of the data set is called the jackknife procedure. In a data set containing n members, the procedure produces n rearranged groups, each containing n – 1 members.

The jackknife procedure, which will later be used here to evaluate the “fragility” of a comparison of two groups of data, is particularly valuable in certain multivariable analyses. At the moment, however, the main value of the jackknife is its use as a “screening” procedure for the stability of a mean. In the data set of {1, 6, 9} we might want to avoid calculating standard errors with either a theoretical formula or a bootstrap resampling. If the jackknife possibilities show that a mean of 5.33 can drop to 3.5 or rise to 7.5 with removal of only 1 member, the mean is obviously unstable. This type of “screening” examination, like various other tactics discussed later, offers a simple “commonsense” approach for making statistical decisions without getting immersed in mathematical formulas or doctrines.

6.4.5“Computer-Intensive” Statistics

The bootstrap and jackknife procedures are examples of a “revolution,” made possible by computers, that is now occurring in the world of statistics. An analogous revolution, which has caused diagnostic radiology to change its name to imaging, has already been facilitated by computers in the world of medicine.

The jackknife procedure was first proposed by Quenouille 2 and later advocated by Tukey.3 An analog of resampling was developed many years ago by R. A. Fisher4 to construct a special permutation test discussed in Chapter 12, but the modern approaches to resampling methods were developed by Julian Simon5 and independently by Bradley Efron6. Efron introduced the term bootstrap and later gave the name computer-intensive7 to both types of new procedures, because they could not be used regularly or easily until computers were available to do the necessary calculations.

In the “revolution” produced by the new procedures, statistical inference with probabilities can be done in an entirely new manner, without relying on any of the theoretical “parametric” mathematical models that have become the traditional custom during the past century. The revolution has not yet

© 2002 by Chapman & Hall/CRC