Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Ординатура / Офтальмология / Английские материалы / Principles Of Medical Statistics_Feinstein_2002

.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
25.93 Mб
Скачать

11.8.2Origin of Two-Tailed Demands

During the early growth of statistical inference in the 20th century, the mathematical hypotheses were almost always bidirectional. The bidirectional approach probably became established in the days when substantial statistical thought, particularly by R. A. Fisher and his disciples, was devoted to agricultural experiments in which two active treatments, A and B, would be compared to determine which was more effective. Since placebos were not used in the agricultural work, it was reasonable to expect that either A or B might be superior; and a bidirectional scientific hypothesis was entirely appropriate.

As statistical methods became increasingly used in medical and psychosocial research, the reviewers and editors began to demand that claims of “significance” be accompanied by P values; and a level of α = .05 was set for the boundary of “significance.” For the small groups of animals or people who were often studied in the research, the investigators soon discovered that a “nonsignificant” two-tailed P value could sometimes become “significant” if given a one-tailed interpretation. (An example of this event occurred in Exercise 7.4.)

The conversion to one-tailed interpretations could thus salvage results that might otherwise be dismissed as “nonsignificant.” For example, an investigator studying the cultural ambiance of different cities and finding a two-tailed P value of .09 for the superiority of Bridgeport over New Haven, could transform the result to a one-tailed P < .05. Although the superiority of Bridgeport had not been anticipated when the research began, the investigator might nevertheless claim that the research had produced “significant” stochastic support for the concept.

To avoid this type of retrospective manipulation for research hypotheses that had not been clearly stated in advance and that were developed to fit the observed data, the demand was made that all values of P or α be interpreted in a two-tailed manner. This policy has now become well established at prominent medical journals and at various agencies where research results are evaluated.

11.8.3Controversy about Directional Choices

In modern medicine, however, many investigations are deliberately aimed in a single direction. In particular, whenever efficacy is compared for an active agent vs. placebo, the goal is almost always to show superiority for the active treatment. If Excellitol is being tested as a new analgesic agent, we want to show that it relieves pain better than placebo. If the observed results show that placebo is better than Excellitol, the next step would not be to seek stochastic support for the results. Instead, we would look for a different, more effective active agent.

A reasonable argument might then be offered that Excellitol, although better than placebo in relieving pain, might be worse in producing more adverse side effects such as nausea. If so, however, the data about nausea would be recorded in a variable other than pain. The results for the pain and nausea variables would be tested separately, with each test having its own null hypothesis. Therefore, when we stochastically test the results in variables for pain relief and for occurrence of nausea, each test could be done with the one-tailed hypothesis that results for Excellitol will be larger than those for placebo.

If you accept the latter approach, it leads to a subtle problem. Suppose the patient gives a separate single “global” rating for the overall effect of treatment. This rating will incorporate both the good things such as pain relief, and the bad things such as nausea. Should this overall rating be interpreted with a one-tailed or two-tailed hypothesis? Because we may not know in advance whether the nauseous effects of Excellitol are severe enough to overwhelm its anticipated analgesic benefits, we might argue that placebo could turn out to be a better overall agent. Therefore, the hypothesis should be two-tailed. On the other hand, returning to the investigator's original goal, the aim of the research is to show superiority for Excellitol. It will receive no further investigative or public attention if it does not produce a better overall rating. Therefore, the hypothesis should be one-tailed.

The decisions about unidirectional vs. bidirectional hypotheses can become much more complex, particularly when the research is concerned with effects of more than one treatment, with stopping (rather than starting) a treatment in a clinical trial, and with diverse other situations. Because the decision in each situation may involve subtle substantive issues, an easy way to avoid the subtle problems is to insist that all hypotheses be examined in a two-tailed direction. The policy has the advantage of being

© 2002 by Chapman & Hall/CRC

clear and easy to enforce, but also has some distinct disadvantages. According to the opponents, the policy is too rigid: it substitutes arbitrary mathematical dogma for careful scientific reasoning; and it needlessly raises the expenses (and risks) of research because much larger group sizes are needed to obtain stochastic significance for the same quantitative results.

For example, suppose we anticipate finding that the mean for Treatment A is 4 units better than placebo in a study where the standard deviation of the pooled data is expected to be 15. The formula for sample-size calculations will be discussed later in greater detail, but a simple example can be instructive here. To calculate sample size for two equal-sized groups that would produce a two-tailed P value of .05 for the anticipated distinction, we would choose Z.05 = 1.96 and solve the equation:

4

1.96--------------

15/n

where n is the size of one group. The result would be (1.96)(15)/4 n and so n ≥ [(1.96 )(15)/4]2 = 54. If we want only a one-tailed P value of .05, however, Z.1 would be 1.645. The required sample size would drop to n ≥ [(1.645 )(15 )/4]2 = 38. The number of patients (and cost) for conducting the trial would be proportionately reduced by 30% [= (54 38)/54].

11.8.4Compromise Guidelines

Different authorities have taken opposing positions on this issue, and the controversy is currently unresolved. In one policy, tests of hypotheses should always be two-tailed; in the other policy, one-tailed tests are allowed in appropriate circumstances.

The following “compromise” guidelines would probably be acceptable to all but the most rigid adherents of the two-tailed policy:

1.Always use a two-tailed procedure if an advance unidirectional hypothesis was not stated before the data were examined.

2.A one-tailed test is permissible, however, if the appropriate hypothesis was stated before the data were examined and if the direction of the hypothesis can be suitably justified.

3.If the “significance” reported with a one-tailed test would not persist with a two-tailed test, the distinction should be made clear in the text of the report.

4.Because your statistical colleagues will generally be happier with two-tailed tests, use them whenever possible. Thus, if the “significance” achieved with a previously stated one-tailed hypothesis remains when the test is two-tailed, report the two-tailed results.

Regardless of whether you decide to work at levels of α , 2α , or α /2, however, another basic and controversial question (although currently less prominent) is whether any rigid boundary should be set for the stochastic decisions. This question is discussed in Section 11.12.

To avoid all of the foregoing reasoning and arguments, investigators can report the exact value of the two-tailed P, and then let the reader decide. This approach, which was not possible with the older tabulated “look-up” values that would allow only statements such as P > .05 or P < .05, can now be used when modern computer programs report exact values such as .087 or .043 for the two-tailed P. Receiving this result, the reader is then left to interpret it with whatever authoritative guidelines seem most persuasive. The only problem with the “computerized-P” approach is that (as discussed later) the one-tailed P is not always simply half of the two-tailed P.

11.9 Alternative Hypotheses

As the opposite of the null hypothesis, the counter-hypothesis merely states a direction. It does not state a magnitude. Suppose we have observed that XA = XB + 5 . The null hypothesis for testing this result is µA µB = 0. The subsequently determined P value or confidence interval reflects the probability of

© 2002 by Chapman & Hall/CRC

observing a difference of 5 if the null hypothesis is correct, but the counter-hypothesis does not specify the magnitude of the difference. The counter-hypothesis is merely |µA − µB| > 0 if two-tailed, and µA − µB > 0 if one-tailed.

If we are concerned about how large the difference might really be, it becomes specified with an alternative hypothesis. For example, if we want to show that XA – XB is not really as large as 12, the alternative hypothesis would be µA − µB 12.

An alternative hypothesis is often stipulated when the observed result appears to be “negative,” i.e., when the customary null hypothesis is not rejected. For example, if P > .3 for the stochastic test when XA – XB = 5 , we can conclude that a “significant” stochastic difference was not demonstrated. We might wonder, however, whether an important quantitative difference was missed. If 12 is chosen as the magnitude of this difference, we could reappraise the data under the alternative hypothesis that µA − µB 12. If the latter hypothesis is rejected, we could then conclude that the observed result ofXA – XB = 5 is not likely to reflect a difference that is at least as large as 12.

For showing that an observed difference is stochastically significant, the level of α indicates the likelihood of a false positive conclusion if the ordinary null hypothesis (µA µB = 0) is rejected when it is correct. If an alternative hypothesis is correct, however, we run the risk of afalse negative conclusion when the null hypothesis of 0 is conceded. The ideas will be discussed in detail later in Chapter 23, but two main points can be noted now. The first point is that the extremes of the confidence interval are often explored for questions about the possibility of false negative conclusions. For example, an observed increment of .08 in two proportions might be regarded as “nonsignificant” both quantitatively (because it is too small) and stochastically, because P > .05 and because the 95% confidence interval, from .06 to +.22, includes 0. The possibility of a quantitatively significant difference (e.g., .15) cannot be dismissed, however, because the confidence interval extends to .22.

The second point is that although examining the extremes of confidence intervals is an excellent way to “screen” for alternative possibilities, an alternative hypothesis (such as π A − π A .15) is often formally tested for showing that an observed difference is “insignificant.” An additional stochastic boundary, called β , is usually established for this decision. The level of β indicates the likelihood of a false negative conclusion if the alternative hypothesis is rejected when it is correct.

In statistical jargon, the false-positive conclusion is often called a Type I error, and the false-negative conclusion, a Type II error. The value of 1 – β is often called the statistical power of the study, i.e., its ability to avoid a Type II error. This additional aspect of “statistical significance” introduces a new set of ideas and reasoning that will be further discussed in Chapter 23.

11.10 Multiple Hypotheses

Another important question is what to do about α when a series of comparisons involves multiple hypotheses about the same set of data or multiple tests of the same basic hypothesis. The value established for α indicates only the chance of getting a false positive result in a single comparison where the null hypothesis is true. With multiple comparisons, however, the level of α may be misleading. For example, suppose we do 20 randomized trials of a treatment that is really no better than placebo. If we use a two-tailed level of α = .05 for each trial, we might expect by chance that one of those trials will produce a “significant” result. If we use a one-tailed level of α = .1, two of the trials might produce such results.

11.10.1Previous Illustration

This same problem occurred earlier in considering whether the dice were “loaded” if two consecutive 7’s were tossed. In the multiple events that occur when the two dice are tossed on many occasions, a pair of consecutive 7’s can readily appear by chance if the dice are perfectly fair. Because the probability of tossing a 7 is 6/36 = 1/6, the probability of getting two consecutive 7’s is (1/6)(1/6) = 1/36. At a dice table where the action is fast, two consecutive 7’s could readily appear in the short time interval consumed by 36 tosses.

© 2002 by Chapman & Hall/CRC

To determine the chance of getting at least one 7 in several consecutive tosses, we can note that the chance of getting it in one toss is 1/6 and the chance of not getting it is 5/6. Therefore, the chance of not getting a 7 is (5/6)(5/6) = .69 for two tosses, (5/6)5 = .40 for five tosses, and (5/6)20 = .03 for 20 tosses. Thus, the chances that a 7 will appear at least once are .17 (= 1/6) in one toss, .31 (= 1 .69) in two tosses,

.60 in five tosses, and .97 in twenty tosses. If the null hypothesis is that the tossed dice are “loaded” to avoid a value of 7, the hypothesis will regularly be rejected by chance alone if tested on multiple occasions.

When you carry your knowledge of statistics into the pragmatic world of gambling casinos, this phenomenon may help you decide when to bet against a shooter who is trying to make a non-7 “point” before tossing a 7.

11.10.2Mechanism of Adjustment

In the world of medical research, the chance results that can emerge from multiple testing may require an adjustment or precaution, which usually involves a change of the α level for individual decisions. As seen in the previous illustration of tossing dice, the false positive boundary of α for a single stochastic test becomes a true positive probability of 1 – α . For two tests, the true positive probability is (1 α )

(l α ), so that the false positive boundary becomes 1

(1 α )2. For k tests, the false positive boundary

becomes 1 (1 – α )k. Consequently with k tests, the α

level for a false positive decision is really lowered

to 1 – (1 α )k. This formula indicates why, if 1/6 was the chance of a 7 appearing once, its chance of

appearing at least once in 20 tosses was 1 [1

(1/6)]20 = 1

(5/6 )20 = 1 .03 = .97.

Many proposals have been made for strategies that lower α

to an individual value of α′ that allows

the final overall level of 1 (1 α′ )k to be

≤ α . The diverse strategies have many eponymic titles

(Duncan, Dunnett, Newman-Keuls, Scheffe, Tukey) based on various arrangements of data from an “analysis of variance.” The simplest, easiest, and most commonly used approach, however, is named after Bonferroni. In the Bonferroni correction for k comparisons, α′ = α /k . Thus, for four comparisons, an α of .05 would be lowered to α′ = .05/4 = .0125 as the level of P required for stochastic significance of each individual comparison. With α′ = .0125, the “final” level of α = 1 (.9875)4 = 1 .95 = .05.

11.10.3Controversy about Guidelines

Although the Bonferroni correction is often used as a mechanism for how to do the adjustment, no agreement exists about when to do it.

Suppose an investigator, believing active treatment A is better than active treatment B, does a randomized trial that also includes a placebo group (for showing that both treatments are actually efficacious). Should the α level be lowered to α /3 for the three comparisons of A vs. B, A vs. placebo, and B vs. placebo? As another issue, suppose an investigator regularly checks the results of an ongoing trial to see if it should be stopped because a “significant” result has been obtained? If k such checks are done, should each check require an α′ level of α /k? Yet another question is what to do about α in “data dredging” or “fishing expeditions” where all kinds of things might be checked in more than 500 comparisons searching for something “significant.” If α is lowered from .05 to the draconian level of

.0001 for each comparison, even a splendid “fish” may be rejected as “nonsignificant” when caught. Because the answers to these questions involve more than mathematical principles alone, no firm

guidelines (or agreements) have been yet developed for managing the challenges. The issues are further discussed in Chapter 25.

11.11 Rejection of Hypothesis Testing

Regardless of whether the decisions are one-tailed or two-tailed, the most difficult task in all of the stochastic reasoning is to choose a boundary for rejecting null hypotheses or drawing conclusions from confidence intervals.

© 2002 by Chapman & Hall/CRC

11.11.1Complaints about Basic Concepts

Some authors avoid this choice entirely by rejecting the basic ideas of testing stochastic hypotheses and drawing conclusions about “significance.” For example, in the encyclopedic four volumes called, The Advanced Theory of Statistics, Maurice Kendall and Alan Stuart4 refuse to use the terms null hypothesis and significance because they “can be misleading.” Reservations about stochastic testing were also stated by two prominent leaders in the American statistical “establishment,” William Cochran and Gertrude Cox:5 “The hard fact is that any statistical inference made from an analysis of the data will apply only to the population (if one exists) of which the experiments are a random sample. If this population is vague and unreal, the analysis is likely to be a waste of time.”

In a remarkable book called The Significance Test Controversy,6 published over 30 years ago, investigators mainly from the psychosocial sciences lamented the harm done by “indiscriminate use of significance tests.” The editors of the book concluded “that the significance test as typically employed … is bad statistical inference, and that even good statistical inference … is typically only a convenient way of sidestepping rather than solving the problem of scientific inference.”

Amid the many attacks (and defenses) in the 31 chapters of the book, none of the writers mentioned the idea of evaluating stability of the numbers. The main complaints about “significance testing” were that the research could not be statistically generalized because it usually contained “convenience” groups rather than random samples, that errors in the reliability of the basic data were usually ignored by the tests, that significance should denote substantive meaning rather than a probabilistic magnitude, and that infatuation with stochastic significance had overwhelmed the priority of attention needed for substantive significance.

In one of the chapters of the book, Joseph Berkson, the leading medical biostatistician of his era, objected to the null hypothesis because “experimentalists [are not] typically engaged in disproving things. They are looking for appropriate evidence for affirmative conclusions. … The rule of inference on which [tests of significance] are supposed to rest has been misconceived, and this has led to certain fallacious uses.”

11.11.2Bayesian Approaches

The proponents of Bayesian inference are willing to calculate stochastic probabilities, but complain that the classical frequentist approaches are unsatisfactory. Some of the complaints7,8 are as follows:

1.It is counter-intuitive and may be scientifically improper to draw conclusions about “more extreme values” that were not observed in the actual data.

2.Two groups will seldom if ever be exactly equivalent, as demanded by the null hypothesis.

3.Confused by the contradictory method of forming hypotheses, many readers mistakenly believe that P values represent the probability that the null hypothesis is true.

4.Conventional confidence intervals do not solve the problem, since they merely indicate the potential frequentist results if the same study had unlimited repetitions.

5.Frequentist approaches are “rigidly” dependent on the design of the research, whereas Baye - sian methods “flexibly” allow the application of subjective probabilities, derived from all available information.

In the frequentist approach, the stochastic conclusion indicates the probability of the data, given the hypothesis. In the Bayesian approach, the stochastic conclusion is a “posterior” determination of the probability of the hypothesis, given the data. This determination requires that a subjective appraisal (i.e., an enlightened guess) of a value for the prior probability of the hypothesis be multiplied by the value of a likelihood function, which is essentially the probability of the data given the hypothesis. In an oversimplified summary of the distinctions, the Bayesian’s conclusive P value is produced when the frequentist’s conclusive P value is modified by a subjective prior probability, and denotes the chance that the prior hypothesis is correct.

© 2002 by Chapman & Hall/CRC

The controversy recently received an excellent and often comprehensible discussion in a special issue of Statistics in Medicine,9 where the respective “cases” were advocated and later attacked on behalf of either frequentism10 or Bayesianism10 in clinical trials.

11.12 Boundaries for Rejection Decisions

Despite the cited reservations and denunciations, stochastic tests of significance have survived, prospered, and prevailed. They are now routinely demanded by editors, reviewers, granting agencies, and regulatory authorities. As an investigator, you usually cannot escape the tests if you want to get your research funded or its publication accepted; and as a reader, you will find results of the tests appearing in most of the papers published in respectable journals. The Bayesian alternatives may have many merits, but they are seldom used.

Regardless of whether the customary tests are good or bad, worthwhile or harmful, they are there; and they will continue to be there for the foreseeable future. Accordingly, like it or not, we have to consider the daunting task of setting boundaries for decisions to reject a stochastic null hypothesis.

In stochastic hypotheses the observed result always has a chance, however tiny, of having occurred under the null hypothesis. The decision to reject the hypothesis will therefore be wrong if it was true on that occasion. The value set for α indicates the frequency with which we are willing to be wrong. Thus, if α is set at .05, we accept the chance of being wrong in one of every twenty times that the null hypothesis is rejected. (The wry comment has been made that “Statisticians are the only members of society who reserve the right to be wrong in 5% of their conclusions.”)

Relatively few people would calmly accept the idea that their decisions in daily life would have so high a frequency of error. Physicians practicing medicine (with or without the additional scrutiny of lawyers) might be unable to maintain an unperturbed aequanimitas if one clinical decision in 20 was flagrantly wrong. Nevertheless, if statistical inference requires hypothesis testing, a boundary must be set for the rejection zone.

11.12.1P Values and α

Since probability values are measured in a continuum that extends from 0 to 1 (or from 0% to 100%), choosing a boundary to demarcate a small enough level of α is particularly invidious. The decision is somewhat like answering the question, “How large is big?” or “How much is enough?” If asked one of those two questions, you would probably answer, “It all depends.”

Nevertheless, a specific boundary was suggested many years ago by R. A. Fisher,11 when he introduced the name tests of significance for the rearrangement procedures. As noted earlier, Fisher set α at .05 because about two standard deviations around the mean would span the 95% inner zone of data in a Gaussian distribution. He wrote, “It is convenient to take this point (i.e., α = .05) as a limit in judging whether a deviation is to be considered significant or not. Deviations exceeding twice the standard deviation are thus formally regarded as significant.”

In a posthumous biography, Fisher’s daughter12 stated that he later came “to deplore how often his own methods were applied thoughtlessly, as cookbook solutions, when they were inappropriate or, at least, less informative than other methods.” In subsequent writings after his initial proposal of .05, Fisher himself did not maintain the 5% boundary. He repeatedly referred to using a “1% level or higher” for stochastic decisions; and eventually he began to discourage the use of any fixed boundary. “The calculation is absurdly academic,” he wrote in 1959, “for in fact no scientific worker has a fixed level of significance at which from year to year, in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in light of his evidence and his ideas.”13

Although many other leaders in the world of statistics have also deplored the fixed boundary of .05, it has become entrenched in medical research. Reviewers and editors may adamantly refuse to publish results in which the P value reached the disastrous heights of .051, while happily accepting other studies that had unequivocal “significance,” with P = .049. Like many past medical doctrines that were maintained

© 2002 by Chapman & Hall/CRC

long after they had been discredited, the “P 0.5” boundary will probably continue until it is replaced by something better (or equally doctrinaire).

11.12.2Confidence Intervals

The problem of choosing a boundary for α is not eliminated by current arguments that P values be replaced by confidence intervals for stochastic decisions. The main contentions in the dispute are whether more information is provided by a P value or a confidence interval; but the choice of a boundary is not discussed. Because of the dominant role of α and Zα (or tα ) in both calculations, a confidence interval is a type of “reciprocal” P value, as shown later in Chapter 13. A boundary of α must be chosen regardless of whether we calculate P to see if P ≤ α , or choose Zα (or tα ) and then inspect the contents of the

calculated confidence interval. The decision about rejecting the null hypothesis is based on exactly the same data and reasoning, whether the α boundary is examined directly with P or indirectly with the lower (or upper) limit of a confidence interval calculated with Zα or tα .

Consequently, the main advantage of a confidence interval is not the avoidance of an arbitrary α boundary for stochastic decisions. Instead, the confidence interval shows an “other side” that is not displayed when the null hypothesis is conceded because of a too large P value. For example, the null hypothesis would have to be conceded if an observed difference in means was 175, with 102 as the standard error of the difference. The Z value would be 175/102 = 1.72, which is below the Zα of 1.96 required for 2P < .05; and the 95% confidence interval, calculated as 175 ± (1.96)(102), would extend from 25 to 375, thus including 0. The same stochastic decision of “nonsignificant” would be reached with either approach as long as α = .05. The main merit of the confidence interval, however, would be its upper limit of 375, indicating that the “nonsignificant” difference of 175 might be as large as 375.

On the other hand, if α were given a one-tailed interpretation, Z.1 = 1.645. The Z value of 1.72 exceeds this boundary, so that P would be < .05; and the confidence interval calculated as 175 (1.645)(102) would have a lower limit of 7.2, thus excluding 0. With the one-tailed boundary for α , both methods (the P value and confidence interval) would lead to rejection of the null hypothesis and a stochastic proclamation of “significance.”

11.12.3Descriptive Boundaries

Perhaps the only way to avoid a boundary for stochastic decisions about “significance” is to convert the focus of the decision. Instead of examining probabilities for events that might happen when the data are rearranged, we might directly inspect the descriptive possibilities.

For example, suppose we set 300 as the lowest value of δ for quantitatively significant increment between two groups. With this boundary, an observed value below δ would not be dismissed if a reasonable rearrangement would bring the results above δ . With this approach, the observed value of 175 in the foregoing example would be regarded as “nonsignificant” if the stochastic α was set at the two-tailed value of .05. On the other hand, with a reasonable rearrangement of the data (in this instance, using a 95% confidence interval), the value of δ = 300 would be included in the zone of 25 to 375. The observed increment, despite its failure to pass the stochastic hurdle of “significance,” could not be dismissed as insignificant.

If the stochastic α had the one-tailed value of .1, however, the increment of 175 would be regarded as “significant.” Nevertheless, if we set 30 as the highest descriptive boundary of ζ for a quantitatively insignificant difference, the one-tailed 95% confidence interval that extends from 7.2 to 343 would include the 30 value for ζ (as well as the 300 value for δ ). This interval would be too descriptively unstable for a decision in either direction.

To replace stochastic boundaries by descriptive boundaries would make things much more complex than their current state. We would have to choose not only descriptive boundaries for both δ and ζ , but also a method for rearranging the data. The “fragility” technique of unitary removals and relocations offers a method that can avoid hypothesis testing, P values, and choices of α , while producing a result analogous to a confidence interval.

© 2002 by Chapman & Hall/CRC

Although an approach based on descriptive boundaries seems worthwhile and scientifically desirable, it departs from the established paradigms and involves many acts of judgment about which consensus would not be easily attained. Consequently, the entrenched boundaries of α — which are entrenched only because they got there first — are likely to be retained for many years in the future. You might as well get accustomed to their current hegemony, because you will have to live with it until scientific investigators decide to confront the basic descriptive challenges and create a new paradigm for the evaluations.

References

1. Feinstein, 1996; 2. Feinstein, 1990; 3. Walter, 1991; 4. Kendall, 1951, pg. 171; 5. Cochran, 1950; 6. Morrison, 1970; 7. Berry, 1993; 8. Brophy, 1995; 9. Ashby, 1993; 10. Whitehead, 1993; 11. Fisher, 1925; 12. Box, J.F., 1978; 13. Fisher, 1959.

Exercises

11.1. In a properly designed laboratory experiment, an investigator finds the following results in appropriately measured units:

Group A: 1, 12, 14, 16, 17, 17

Group B: 19, 29, 31, 33, 34, 125

The difference in mean values, XA = 12.8 vs. XB = 45.2 , seems highly impressive, but the investigator is chagrined that the t test (discussed in Chapter 13) is not stochastically significant, presumably because of the particularly high variability in Group B. What relatively simple procedure (i.e., no calculations) might the investigator do to get evidence that the distinction in the two groups is stable enough to be persuasive?

11.2.In Section 1.1.3, the quantitative contrast between 8/16 and 6/18 seemed quantitatively impressive because the increment of .500 .333 = .167 seemed reasonably large. How would you interpret the

results of a unit fragility test for this comparison?

11.3.In the examples cited in this chapter, when the observed result turned out to be “nonsignificant,” the upper end of the confidence interval was examined as a possibly “significant” value. In what “nonsignificant” circumstance would you want to examine the lower end of the confidence interval as a possibly significant value?

11.4.Although you may not yet have had much pragmatic experience in testing stochastic hypotheses, you have probably had some “gut reactions” to the controversy about using one-tailed or two-tailed criteria for P values. What are those reactions and what policy would you establish if you were appointed supreme czar of stochastic testing?

11.5.These questions refer to the choice of α = .05 as the boundary for customary decisions about stochastic significance.

11.5.1.Are you content with this boundary? If so, why? If not, why not, and what replacement would you offer?

11.5.2.In what kind of circumstance would you want to change the value of α to a more “lenient” boundary, such as .1 or perhaps .2?

11.5.3.In what kind of circumstance would you want a more strict boundary, such as .01 or .001?

11.5.4.What would be the main pragmatic consequences of either raising or lowering the customary value of α ? How would it affect the sample sizes and costs of research? How would it affect the credibility of the results?

11.6.About two decades ago, the editor of a prominent journal of psychology stated that he wanted to

©2002 by Chapman & Hall/CRC

improve the scientific quality of the published research. He therefore changed the journal’s policy from using α = .05 for accepting statistical claims, and said that henceforth no research would be published unless the P values were <.0l. How effective do you think this policy would be in achieving the stated goal? What kind of research do you think would be most affected by the new policy?

11.7. Here is an interesting optional exercise if you have time. In Section 11.10.1, we noted that the chance of getting two consecutive 7’s in two tosses of dice was .03, and that the chance of not getting a 7 in two tosses was .69. These two apparently opposite probabilities do not add up to 1. Why not?

© 2002 by Chapman & Hall/CRC

12

Permutation Rearrangements:

Fisher Exact and Pitman–Welch Tests

CONTENTS

12.1Illustrative Example

12.2Formation of Null Hypothesis

12.3Rearrangements of Observed Population

12.4Probability Values

12.5Simplifying Calculations

12.5.1Formulas for Permutations and Combinations

12.5.2Application to the Fisher Exact Test

12.6General Formula for Fisher Exact Test

12.6.1Another Example of Calculations

12.6.2Additional Considerations

12.7Application of Fisher Test

12.7.1Small Numbers

12.7.2One-Tailed Tests

12.7.3Controversy about “Fixed” or “Random” Marginals

12.8“Confidence Intervals” from Fisher Test

12.9Pitman–Welch Permutation Test

12.9.1Example of a Contrast of Two Means

12.9.2Summary of Distribution

12.9.3Simplified Arrangement

12.9.4“Confidence Intervals”

12.9.5Additional Application

References

Exercises

Of the main methods that rearrange two groups of data for stochastic tests, permutation procedures are particularly easy to understand. Unlike parametric methods, no theoretical assumptions are required about hypothetical populations, parameters, pooled variances, or mathematical model distributions; and unlike jackknife or relocation methods, the process involves no arbitrary removals or displacements. Everything that happens emerges directly from the empirically observed data. Furthermore, for contrasts of two groups, a permutation procedure is not merely statistically ‘‘respectable”; it is often the ‘‘gold standard” for checking results of other stochastic procedures.

The permutation strategy relies on the idea that when two groups of observed data are compared under the null hypothesis, the data can be pooled into a single larger group. The entire larger group is then permuted into all possible rearrangements that assign the data into batches of two groups of appropriate size. The results in the rearranged groups form a distribution that can be evaluated for P values and, if desired, for a counterpart of confidence intervals.

The best known permutation arrangement procedure, discussed in this chapter, is called the Fisher exact probability test. It was first proposed by R. A. Fisher 1 in 1934 and amplified by J. O. Irwin2 in 1935. The eponym often cited is the Fisher–Irwin test, Fisher exact test, or Fisher test, although the last term may sometimes be ambiguous because Fisher also devised so many other procedures.

© 2002 by Chapman & Hall/CRC