Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Ординатура / Офтальмология / Английские материалы / Principles Of Medical Statistics_Feinstein_2002

.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
25.93 Mб
Скачать

After reading Alfred Tennyson’s poem “The Vision of Sin,” Babbage is said to have written the following letter to the poet:

In your otherwise beautiful poem, there is a verse which reads:

“Every moment dies a man Every moment one is born.”

It must be manifest that, were this true, the population of the world would be at a standstill. In truth the rate of birth is slightly in excess of that of death. I would suggest that in the next edition of your poem you have it read:

“Every moment dies a man

Every moment 1 1 is born.”

-----

16

Strictly speaking this is not correct. The actual figure is a decimal so long that I cannot get it in

the line, but I believe 1 1 will be sufficiently accurate for poetry. I am etc.12

-----

16

6.8.2Estimates of Amounts

After hearing reports that a child “throws up a lot” or “ throws up the whole feed,” Pillo-Blocka et al. 13 asked 58 mothers of children less than 2 years old to estimate “two volumes of standard formula — 5 ml. and 10 ml. — which had been spilled uniformly over a baby’s wash cloth, with the mother out of sight.” The median estimates were 30 ml. for the smaller spill and 50 ml. for the larger. Only 1 of the 58 mothers underestimated the spill, and only two mothers were close to correct. All the other mothers “grossly” overestimated the emesis volumes, and the degree of accuracy had no correlation with the mothers’ educational status or age.

6.8.3Estimates of Frequencies and Probabilities

In the foregoing example, the observers responded to a direct visual challenge. In the other investigations cited here, the research subjects were asked conceptually to quantify the meaning of words for frequency and/or probability. Different subjects and research structures were used for the projects. The number of verbal terms to be quantified ranged from a few (i.e., 5) to many (i.e., 41). The quantitative expressions could be chosen from the following citations: dimensional selections on an unmarked continuum from 0 to l; choices in increments of .05 or .1 extending from 0 to 1; “percentages” for each term; and four formats of quantitative expressions that contained special zones for “high probability” and “low probability,” as well as a “uniform” ordinal scale, and a “free choice” between 0 and 100. The research subjects included different groups of medical personnel (nurses, house officers, students, laboratory technologists, hospitalbased physicians in various specialties) as well as members of a hospital Board of Trustees. The nonmedical participants included graduate students at a school of business administration, highly skilled or professional workers (secondary school teachers, engineers), an otherwise unidentified group of “nonphysicians,” and a set of male employees of the “System Development Corporation.”

With these diverse designs and approaches, the investigators then proceeded to analyze and report the inconsistencies.

6.8.3.1 Frequencies — Figure 6.7 shows Robertson’s summary14 of four studies that included ratings for five categories ranging from always to rarely. Although the mean values were reasonably consistent (except for a Board of Trustees mean estimate that rarely had a frequency of 23%), the most noteworthy results were the wide range of individual responses. In the conclusion of the review, physicians were urged to use precise percentages instead of “shooting from the mouth.”

Analogous results were obtained when 94 physicians and 94 non-physicians responded to a similar challenge that contained 22 verbal modifiers.15 The investigators concluded that physicians had a better

© 2002 by Chapman & Hall/CRC

Quantifying Word Meanings

 

 

 

 

Children's

 

 

 

University

Orthopedic

 

 

Seattle

of

Hospital,

Sample

N Engl J Med

Physicians, Washington

Board of

Study, %

%

MBAs, %

Trustees, %

Term

(n=32)

(n=53)

(n=80)

(n=40)

Always

 

 

 

 

_

99 + 2

98 + 6

98 + 3

100 + 0

x + SD

Range

90-100

60-100

80-100

100

Often

 

 

 

 

_

61 + 13

59 + 17

61 + 16

57 + 13

x + SD

Range

30-80

20-90

20-90

20-90

Sometimes

 

 

 

 

_

33 + 17

34 + 16

38 + 12

37 + 21

x + SD

Range

10-60

0-90

10-60

10-70

On occasion

 

 

 

 

_

12 + 7

20 + 16

20 + 10

18 + 21

x + SD

Range

0-20

0-70

5-50

10-60

Rarely

 

 

 

 

_

 

 

 

 

x + SD

5 + 4

15 + 34

12 + 20

23 + 36

Range

0-10

0-95

0-90

0-60

FIGURE 6.7

Summary of results in four studies of quantification for five terms referring to frequency. [Figure adapted from Chapter Reference 14.]

consensus on meaning of the verbal expressions than laymen, but that physicians should take “no consolation” from the apparent superiority. According to the authors, the results “highlight the folly of assuming that any two randomly chosen physicians are likely to have similar percentages in mind” when these terms are used and that laymen have an “even greater likelihood of misunderstanding” when the terms are communicated. The authors’ main recommendation was the subtitle of their paper: “Verbal specifications of frequency have no place in medicine.”

This argument was reinforced in a study16 showing that endoscopists had major differences, unrelated to experience or site of training, in the number of cm. intended when the size of a gastric ulcer was described as small, medium, or large. With this variation, an ulcer called large in a first examination and small in a second may actually have increased in size.

6.8.3.2 Probabilities — In other investigations of medical personnel, diverse ranks and ranges of values were obtained for 30 “probability estimates” in a study17 of 16 physicians; the magnitude of “rare” seemed to change with the pharmaceutical agent for which side effects were being estimated;18 different medical professionals were found19 to have relatively good agreement in quantitative beliefs about 12 expressions ranging from certain to never. In the last study, the numerical estimates were reasonably consistent within groups studied in different eras, and clinical context influenced the values but not the order of the quantitative ratings. Although recommending development of a “codification based on common usage,” Kong et al.19 wanted to do additional research (apparently not yet reported) “before proposing an overall scheme.” The summary results were regarded as possibly “adequate for codifying usage for physicians speaking to physicians,” but the authors believed that a formal code could not be enforced, because physicians and patients may want “to use qualitative expressions and some people are more comfortable with them than with exact values. Sometimes we want to be vague.”

© 2002 by Chapman & Hall/CRC

6.8.4Virtues of Imprecision

The virtues of preserving certain forms of “vagueness” have been noted by several commentators, including Mosteller and Youtz,20 who presented a meta-analysis of 20 studies of quantification for verbal expressions of probability. According to Clark,21 quantitative studies often fail to distinguish between “word meaning” and “word use.” Cliff22 argued that “words are inherently fuzzy and communicating degree of fuzziness is a significant aspect of communication ... (I)f boundaries were not vague, there would be endless debates about where such boundaries should lie.” Wallsten and Budescu 23 stated that “probability phrases should not be made to appear more precise than they really are” and that sometimes an “imprecise judgment is useful information in and of itself.”

The imprecise qualitative terms have survived in human communication because they are necessary and desirable. Such terms are often used when an exact number is either unknown or not needed, or when the precise quantity, after being determined with a great deal of effort, might be misleading. For example, the word often was used in the preceding sentence because I would not want to identify or count all the possible occasions on which an exact quantity is best replaced with an imprecise term; and I doubt that citing proportions such as 8/21 or 37/112 would convey the idea better than often. The term a great deal was also used in that sentence because I have no data — and have no intention of getting the data — on how much effort was spent when the investigators did the cited research projects. At best, the amount of time might be measured or estimated, but no suitable rating scale exists for this type of effort. I feel sure, i.e., 100% certain, that the total research took more than a few days, but the term a great deal suitably conveys the idea.

When doctors communicate with one another or with patients, the exchanged words must be clearly and mutually understood. This understanding can be thwarted by both quantitative and nonquantitative features of the communication: an inadequate scale may be used to express risk;24 the doctor may be too busy or imperceptive; the patient may be too frightened or intimidated; the patient may have answered a self-administered questionnaire containing ambiguous phrases and concepts. In clinical practice, the direct personal exchange called history taking can help prevent or remedy the misunderstandings if both doctors and patients are aware of the possibilities and if both groups are constantly encouraged to ask, “What do you mean by that?”

Some of the desired quantifications may be statistically precise but clinically wrong. For example, prognosis in patients with cancer is commonly estimated from statistical results for morphologic categories of the TNM staging system. A patient in TNM Stage III, with an “inoperable” metastasized cancer, might be told, “Your chance of 6-month survival is 57%.” Nevertheless, with additional clinical classifications based on functional severity of illness,25 the TNM stage III can be divided into distinct clinical subgroups with prognoses for 6-month survival ranging from 5% to 85%. A physician who uses the 57% estimate may be given credit for being a good quantifier, but the patient is poorly served by statistically average results that may not pertain to individual clinical distinctions.

Efforts to quantify the unquantifiable have produced currently fashionable mathematical models, such as decision analysis and quality-adjusted-life years (QALYs), that have all the virtues of precision, quantitative analysis, economic comparisons, successful grant requests, and multiple academic publica - tions, but none of the virtues of a realistic portrait of clinical decisions and human aspirations.26

Having made many efforts to improve quantification in clinical medicine, including the construction of specific rating scales for clinimetric phenomena27 that are now identified imprecisely or not at all, and being now engaged in further quantitative advocacy by writing a book on statistics, I trust that the foregoing comments will not be misunderstood. Workers in the field of health need quantitative information for its many invaluable and irreplaceable contributions to communication, reasoning, and human life. The rest of this text is devoted to the ways of analyzing that information. We also, however, need qualitative terms and quantitatively imprecise words for all the human phenomena and ideas that cannot or should not be subjugated into quantitative rigidity. Like workers in fireworks factories, we should remember that it is sometimes better to curse the darkness than to light the wrong candle.

I shall not demand that a great chef quantify a “pinch” of salt, that a superb violinist indicate the Hertzian frequency of a splendid vibrato, or that an excellent artist specify wave length or area for a “dab” of color. Trying to do a good job of writing, I hope you will let me end this chapter here. If so, I shall thank you very much, unless you insist on .863 units of gratitude.

© 2002 by Chapman & Hall/CRC

References

1. Student, 1908; 2. Quenouille, 1949; 3. Tukey, 1958; 4. Fisher, 1925; 5. Simon, 1969; 6. Efron, 1979; 7. Efron, 1983; 8. De Leeuw, 1993; 9. Micceri, 1989; 10. Teigen, 1983; 11. Boffey, 1976; 12. Babbage, 1956; 13. Pillo-Blocka, 1991; 14. Robertson, 1983; 15. Nakao, 1983; 16. Moorman, 1995; 17. Bryant, 1980; 18. Mapes, 1979; 19. Kong, 1986; 20. Mosteller, 1990; 21. Clark, 1990; 22. Cliff, 1990; 23. Wallsten, 1990; 24. Mazur, 1994; 25. Feinstein, 1990d; 26. Feinstein, 1994; 27. Feinstein, 1987a.

Exercises

6.1.Here is a set of questions that will help prepare you for your next visit to the statistical citadels of Atlantic City, Las Vegas, or perhaps Ledyard, Connecticut:

6.1.1.A relatively famous error in the history of statistics was committed in the 18th century by d’Alembert. He stated that the toss of two coins could produce three possible outcomes: two heads, two tails, or a head and a tail. Therefore, the probability of occurrence was 1/3 for each outcome. What do you think was wrong with his reasoning?

6.1.2.The probability of tossing a 7 with two dice is 1/6. Show how this probability is determined.

6.1.3.The probability of getting a 7 on two consecutive tosses of dice is (1/6)(1/6) = 1/36 = .03 — a value smaller than the .05 level often set for “uncommon” or “unusual” events. If this event occurred while you were at the dice table of a casino, would you conclude that the dice were “loaded” or otherwise distorted? If not, why not?

6.1.4.The shooter at the dice table has just tossed a 6. What is the probability that his next toss will also be a 6?

6.1.5.At regular intervals (such as every 30 minutes) at many casinos, each of the identical-looking roulette wheels is physically lifted from its current location and is exchanged with a roulette wheel from some other location. The process is done to thwart the “system” of betting used by certain gamblers. What do you think that system is?

6.2.Using the chloride data of Table 4.1, please answer or do the following:

6.2.1.What is the probability that a particular person has a chloride level that lies at or within the boundaries of 94 and 104?

6.2.2.What interval of chloride values contains approximately the central 90% of the data?

6.2.3.Assume that the data have a Gaussian pattern with mean = 101.572 and standard deviation = 5.199. Using Table 6.4 and the mathematical formula for Gaussian Z-scores, answer the same

question asked in 6.2.1.

6.2.4.Using the same Gaussian assumption and basic methods of 6.2.3, answer the same question asked in 6.2.2.

6.3.We know that a particular set of data has a Gaussian distribution with a standard deviation of 12.3. A single randomly chosen item of data has the value of 40. What would you do if asked to estimate the location of the mean for this data set?

6.4.An investigator has drawn a random sample of 25 items of dimensional data from a data bank. They have a mean of 20 with a standard deviation of 5. What mental screening process can you use (without any overt calculations) to determine whether the data are Gaussian and whether the mean is stable?

6.5.A political candidate feels confident of victory because he is favored by 67% (= 6/9) of potential voters in a political poll.

6.5.1. What simple

calculations

would you do — with pencil

and paper,

without

using a

calculator — to find

the standard

deviation, standard error, and

coefficient of

stability

for this

proportion? Do you regard it as stable?

6.5.2. Suppose the same proportion is found in a poll of 900 voters. Would you now regard the result as stable? What nonstatistical question would you ask before drawing any conclusions?

6.6. “Statistical significance” is usually declared at α = .05, i.e., for 2P values at or below .05. A group

© 2002 by Chapman & Hall/CRC

of investigators has done a randomized clinical trial to test whether a new active drug is better than placebo. When appropriately calculated, by methods discussed later in Chapter 13, the results show a Gaussian Z-score of 1.72 for a difference in favor of the active drug. Because the corresponding onetailed P value is <.05, the investigators claim that the difference is “statistically significant.” The statistical

reviewers, however, reject this claim. They say that the result is not significant, because the P value should be two-tailed. For Z = 1.72, the two-tailed P is >.05. What arguments can be offered on behalf

of both sides of this dispute?

6.7.Find a published report — not one already cited in the text — in which an important variable was cited in a verbal scale of magnitude, i.e., an ordinal scale. Comment on what the investigators did to ensure or check that the scale would be used with reproducible precision. Do you approve of what they did? If not, what should they have done?

6.8.Here is a chance to demonstrate your own “quantitative” concepts of “uncertainty,” and for the class convener to tabulate observer variability in the ratings.

Please rank the following verbal expressions from l to 12, with 1 indicating the highest degree of assurance for definite occurrence and 12 the lowest.

Probable

Uncertain

Expected

Possible

Credible

Impossible

Likely

Certain

Doubtful

Hoped

Unlikely

Conceivable

© 2002 by Chapman & Hall/CRC

7

Confidence Intervals and Stability:

Means and Medians

CONTENTS

7.1Traditional Strategy

7.1.1Goals of Sampling

7.1.2Random Sampling

7.1.3Reasons for Learning about Parametric Sampling

7.2Empirical Reconstructions

7.2.1Jackknifed Reconstruction

7.2.2Bootstrap Resampling

7.3Theoretical Principles of Parametric Sampling

7.3.1Repetitive Sampling

7.3.2Calculation of Anticipated Parameters

7.3.3Distribution of Central Indexes

7.3.4Standard Error of the Central Indexes

7.3.5Calculation of Test Statistic

7.3.6Distribution of the Test Statistics

7.3.7Estimation of Confidence Interval

7.3.8Application of Statistics and Confidence Intervals

7.3.9Alternative Approaches

7.4Criteria for Interpreting Stability

7.4.1Intrinsic Stability: Coefficient of Stability

7.4.2Extrinsic Boundary of “Tolerance”

7.5Adjustment for Small Groups

7.5.1The t Distribution

7.5.2Explanation of Degrees of Freedom

7.5.3Using the t Distribution

7.5.4Distinctions of t vs. Z

7.6Finite Population Correction for Unreplaced Samples

7.7Confidence Intervals for Medians

7.7.1Parametric Method

7.7.2Bootstrap Method

7.7.3Jackknife Method

7.8Inferential Evaluation of a Single Group

7.8.1One-Group t or Z Test

7.8.2Examples of Calculations

Appendixes: Documentation and Proofs for Parametric Sampling Theory A.7.1 Determining the Variance of a Sum or Difference of Two Variables A.7.2 Illustration of Variance for a Difference of Two Variables

A.7.3 Variance of a Parametric Distribution A.7.4 Variance of a Sample Mean

A.7.5 Estimation of σ

References

Exercises

© 2002 by Chapman & Hall/CRC

The concepts of probability discussed in Chapter 6 can now be used for their critical role in evaluating the stability of a central index. For the evaluation, we want to determine not only the magnitude of possible variations when the index is “perturbed,” but also the probability of occurrence for each variation.

The perturbations that produce the magnitudes and probabilities can be done with modern methods of empirical reconstruction, such as the jackknife and bootstrap, or with the traditional parametricsampling strategy.

7.1Traditional Strategy

The traditional statistical strategy — which constantly appears in modern literature — is complicated, making its discussion particularly difficult to read and understand. The reason for the difficulty is that the customary statistical approach involves hypothetical entities such as standard errors and confidence intervals, and uses an array of mathematical theories that do not correspond to reality or to anything that might actually be done by a pragmatic medical investigator.

The theories were originally developed to deal with a real-world challenge called parametric sampling; and they have worked well for that challenge. The mathematical strategy behind the theories was then applied, however, for a quite different challenge in evaluating stability. The same parametric theories thus became used in two very different situations. For parametric sampling, the theories involve the hypothetical possibility of repeatedly taking an infinite number of samples from an unknown population that is seldom actually observed. For evaluating stability, however, we examine what was found in a single group of data.

7.1.1Goals of Sampling

In a sampling process, we determine the mean, standard deviation, or other attributes, called parameters, of a large entity, called the parent population, that would be too difficult or costly to examine completely. For example, in industrial work, such as mining gold or making beer, the chemical testing of quality takes time and destroys the material being analyzed. We therefore use samples (often called aliquots) to save time and to avoid losing too much gold or beer. In marketing activities, the makers of a new product will examine samples of potential purchasers to appraise its possible public reception, because a test of the total population would be unfeasible and too expensive. The sampling activity called polltaking is constantly used by political or social scientists to determine beliefs or opinions in a much larger public than the people included in the sample. On election night, a sampling process is used when the media try to forecast the winner before all the precincts have been reported.

The sampling activities are not always successful; and the failures have become legendary examples of errors to be avoided or corrected. In the United States, many older persons still remember the fallacious marketing research that led to the Edsel automobile and to the (temporary) removal of classic Coca Cola from the market; and almost everyone can give an example of blunders in political polls or in electionnight forecasts. In general, i.e., “on average,” however, the pragmatic results have confirmed (or “validated”) what was found with the theoretical sampling strategy.

7.1.2Random Sampling

An important basis for the theoretical strategy is that the sampling be done randomly. The term random is not easy to define, but it requires that every member of the target or parent population have an equal chance of being selected for inclusion in the sample. This requirement is not fulfilled if the sample is chosen haphazardly (e.g., without a specific plan) or conveniently (e.g., whatever happens to be available in the persons who came to the clinic today or who mailed back a questionnaire). To get a truly random sample from a parent population requires special maneuvers, such as making selections by tossing a coin or using either a random-number generator or table of random numbers.

© 2002 by Chapman & Hall/CRC

Many sampling activities fail not because the mathematical theory was flawed, but because the sample was not a representative random choice. The material may not have been well mixed before the tested aliquots were removed. The group of people who mailed back the questionnaire may have been a relatively extreme sector of the target population. (Another nonmathematical source of error is the way the questions were asked. The inadequate wording or superficial probing of the questions is now regarded as the source of the marketing disasters associated with the Edsel and with “New Coke.”)

The necessity for random sampling is another problem that makes the parametric theory seem strange when used in medical research. Random sampling almost never occurs in clinical investigation, where the investigators study the groups who happen to be available. In a randomized trial, the treatments are assigned with a random process, but the patients who enter the trial were almost never obtained as a random sample of all pertinent patients. A random sampling may be attempted for the household surveys of public-health research or in the “random-digit dialing” sometimes used to obtain “control” groups in case-control studies. Even these activities are not perfectly random, however, because the results usually depend on whoever may be home when the doorbell or telephone rings and on the willingness of that person to participate in the study. Consequently, medical investigators, knowing that they work with potentially biased groups, not with random samples, are often uncomfortable with statistical theories that usually ignore sources of bias and that assume the groups were randomly chosen.

7.1.3Reasons for Learning about Parametric Sampling

Despite these scientific problems and discomforts, the theory of parametric sampling is a dominant factor in statistical analysis today. The theory has been the basic intellectual “ethos” of statistical reasoning for more than a century. If you want to communicate with your statistical consultant, this theory is where the consultant usually “comes from.” The parametric-sampling theory is also important because it has been extended and applied in many other commonly used statistical procedures that involve neither sampling nor estimation of parameters. One of these roles, discussed in this chapter, is evaluating the stability of a central index. Other roles, discussed in later chapters, produce the Z, t, and chi-square tests frequently used for statistical comparisons of two groups. As noted earlier and later, however, the parametric theory may be replaced in the future by “computer-intensive” statistical methods for evaluating stability and comparing groups.

Even if this replacement occurs, however, the parametric theory is still an effective (and generally unthreatened) method for the challenge of determining an adequate sample size before a research project begins. Furthermore, despite the complicated mathematical background, the parametric methods lead to relatively simple statistical formulas and calculations. This computational ease has probably been responsible for the popularity of parametric procedures during most of the 20th century, and will continue to make them useful in circumstances where computers or the appropriate computer programs are not available.

For all these reasons, the parametric theory is worth learning. Before it arrives, however, we shall first examine two relatively simple methods — the jackknife and bootstrap procedures introduced in Chapter 6 — that can yield appraisals of stability without the need for complex mathematical reasoning. The intellectual ardors of parametric theory will begin in Section 7.3, and will eventually be followed by a pragmatic reward: the application of your new parametric knowledge to construct a common statistical procedure called a one-group t test.

7.2Empirical Reconstructions

A parametric sampling, such as a political poll, must be planned “prospectively” before any data are available for analysis. Consequently, the strategy must be based on theoretical principles. If data have already been obtained, however, the stability of the central index can be evaluated “retrospectively” with an empirical reconstruction of the observed results.

© 2002 by Chapman & Hall/CRC

The various empirical rearrangements are based on what was actually observed and do not require theoretical reasoning about a larger, parametric population. Like the parametric approach, the empirical strategy depends on distributions, but each distribution is generated in an ad hoc manner from the observed set of data. Furthermore, the empirical strategies — applied to data obtained without a sampling process — are used only for drawing conclusions about the observed group. The conclusions are not expected or intended to refer to a larger parametric population.

The jackknife reconstruction technique, as shown in Section 7.2.1, determines what would happen to the “unit fragility” of the central index (or some other selected value) if a member is successively removed from (and then returned to) the data set. The bootstrap reconstruction technique, as shown earlier in Section 6.4.3 and here in Section 7.2.2, is a resampling process that determines what would happen in samples prepared as rearrangements of all or some of the observed members of the data.

7.2.1Jackknifed Reconstruction

The jackknife method, which is seldom formally used for checking the stability of a central index, is presented here mainly to give you an idea of how it works and to indicate the kinds of things that can be done when investigators become “liberated” from parametric doctrines. You need not remember any of the details, but one aspect of the process, discussed in Sections 7.2.1.2 and 7.2.1.3, offers a particularly simple “screening test” for stability of a central index.

7.2.1.1Set of Reduced Means — The jackknife process works by examining the variations

that occur when one member is successively removed (and then restored) in the data set. If the data set

has n members, each symbolized as Xi, the mean value is X , calculated as Σ Xi/n. If one member of the set, Xi, is removed, the reduced set of data will have n 1 members, and the “reduced” mean in the smaller data set will be Xi = Xi – Xi )/ (n – 1) = (nX – Xi )/ (n – 1 ) . As each Xi is successively removed, a total of n reduced groups and n reduced means will be created.

For example, consider the group of blood sugar values that appeared in Exercise 3.3. Table 7.1 shows what would happen successively as one member of the group is removed. Because the reduced means range from values of 105.37 to 123.16, we can be completely (i.e., 100%) confident that any of the reduced means will lie in this zone. Because the reduced set of means has 20 values, we can exclude

the two extreme values to obtain a 90% inner zone of 18 values whose ipr90 extends from 112.21 to 122.32. If we want a narrower zone of values, but less confidence about the results, we can prepare an

ipr80 by removing the two highest and two lowest values. With this demarcation, we could be 80% confident that the reduced mean lies between 116.95 and 122.26.

7.2.1.2Decisions about Stability — The jackknife procedure offers a particularly useful “screening” tactic for stability of a central index. The “screening” occurs when we consider both the “distress” that would be evoked by some of the extreme values that appear in the array of reduced jackknife means, and also the possibility that such extreme values might occur.

Thus, in the data of Table 7.1, the most extreme of the 20 values is 400. Its removal, which has a

random chance of 1 in 20, would lower the reduced mean to 105.37. This drop is a reduction of (120.1 105.37)/120.1 = 12.3% from the original value. How distressed would we be if 105.37 were the true mean of the data set, thereby making 120.1 a misrepresentation? Conversely, if the true mean is 120.1, how seriously would the data be misrepresented if the last member (with a value of 400) had been omitted, so that 105.37 were used as the observed mean?

In this data set, which has 20 members, the (one-tailed) probability is only 1/20 = .05 that 105.37 would become the reduced mean. The ipr90 mentioned earlier for the reduced mean denotes a probability of only .10 that the reduced mean would drop to 112.21 (a proportionate change of about 7%) or lower. Because the highest value (123.16) for the reduced mean is not too disparate from the original value, we would be concerned mainly about the removal of extreme individual values such as 270 or 400. In highly skewed data sets, such as the blood sugar values in Table 7.1, the relative frequency of possible jackknife changes may therefore be appraised in a “one-tailed” rather than “two-tailed” direction.

© 2002 by Chapman & Hall/CRC

TABLE 7.1

Formation of Reduced Means and Reduced Medians for Data Set in Exercise 3.3

Values of original group: 62, 78, 79, 80, 82, 82, 83, 85, 87, 91, 96, 97, 97, 97, 101, 120, 135, 180,

270, 400. Σ xi = 2402;

X

= 120.10.

 

 

 

 

 

Sum of Values in

 

 

Value Removed

 

Reduced Group

Reduced Mean

Reduced Median

 

 

 

 

 

 

62

2340

 

123.16

96

78

2324

 

122.32

96

79

2323

 

122.26

96

80

2322

 

122.21

96

82

2320

 

122.11

96

82

2320

 

122.11

96

83

2319

 

122.05

96

85

2317

 

121.95

96

87

2315

 

121.84

96

91

2311

 

121.63

96

96

2306

 

121.37

91

97

2305

 

121.32

91

97

2305

 

121.32

91

97

2305

 

121.32

91

101

2301

 

121.11

91

120

2282

 

120.11

91

135

2267

 

119.32

91

180

2222

 

116.95

91

270

2132

 

112.21

91

400

2002

 

105.37

91

 

 

 

 

= 120.10

 

 

 

X

 

 

 

 

 

s = 4.128

 

 

 

 

 

 

 

Thus, with no special calculations except “screening” by the jackknife perturbation, there is a .05 chance that the mean might become proportionately 12% lower than its original value, and a .1 chance that it will be changed by 7% or more. If troubled by these possibilities, you will reject the observed mean as unstable. If untroubled, you will accept it. In the absence of standards and criteria, the decision is “dealer’s choice.”

The main virtue of this type of jackknife “screening” is that you can determine exactly what would happen if one person were removed at each extreme of the data set. You need not rely on an array of theoretical parametric samplings or empirical bootstrap resamplings to determine and then interpret a standard error and confidence interval.

7.2.1.3 Screening with the Jackknife — To evaluate stability of a central index, the jackknife tactic is valuable mainly for mental screening, aided by a simple calculation. For the procedure, remove one item from an extreme end of the dimensional data set and determine the reduced mean. If it is proportionately too different from the original mean, or if it takes on a value that extends beyond an acceptable external boundary, and if the data set is small enough for the event to be a reasonably frequent possibility, the original mean is too fragile to be regarded as stable. The internal and external criteria for stability will be further discussed in Section 7.4.

7.2.2Bootstrap Resampling

Of the two empirical methods for checking the stability of a mean, the complete bootstrap tactic is easy to discuss but hard to demonstrate. Unless the bootstrapped samples are relatively small, a computer is required to do all the work. Furthermore, even with a computer, the complete bootstrap tactic becomes formidable as the basic group size enlarges. For example, as noted earlier in Section 6.4.3, the complete

© 2002 by Chapman & Hall/CRC