Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Ординатура / Офтальмология / Английские материалы / Principles Of Medical Statistics_Feinstein_2002

.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
25.93 Mб
Скачать

and interval estimation.) The inferential strategies were originally developed for use with random sampling only, but were later extended for their now common application to evaluate stability in single groups of data that were not obtained as random samples.

This chapter is devoted to another type of inference, particularly common in medical research and literature today, that is also an extension of the original methods for making estimates from a single random sample. The additional inferential activity, which is called hypothesis testing, uses the same basic strategy as before, but the “parameter” being estimated is the “value” of a mathematical hypothesis. The new process involves three main steps: (1) making a particular mathematical assumption, called a null hypothesis, about a parameter for the observed results; (2) appraising what happens when the observed results are rearranged under that hypothesis; and (3) deciding whether to reject or concede the hypothesis.

The process was illustrated (without being so identified) for the one-group t tests near the end of Chapter 7. In Section 7.8.2.2, we began with the null-hypothesis assumption that the observed data came from a parent population with mean µ = 0. With that assumption, the “rearrangements” were done with theoretical repetitive sampling. Among the possible samples, the observed mean difference of 23.4 had a t-score of 1.598. The corresponding two-tailed P value was between .1 and .2 for the probability that the observed difference, or an even larger one in either direction, would emerge by chance from a parent population of sampled increments whose parametric mean was 0. After this P value was noted, however, the inferential process stopped, without a further decision. We are now ready to discuss how those decisions are made.

11.1 Principles of Statistical Hypotheses

In elementary geometry you probably engaged in the three-step process of forming, exploring, and deciding about hypotheses. To prove that two triangles were congruent, you formed the initial hypothesis that they were not. As the “proof” proceeded thereafter, various things happened that would lead to something impossible. If a hypothesis leads to impossible things, it cannot be maintained. You therefore rejected it and concluded that the triangles were congruent.

When the same type of reasoning is used in statistics, the basic strategy is similar. The initial hypothesis is stated as the opposite of what we want to prove. The “proof” occurs when an impossible consequence makes us reject the hypothesis. Unlike events in geometry, however, the things that might happen under a statistical hypothesis are never wholly impossible. There is always a chance, however tiny, that an extraordinary event might actually occur. Therefore, to reject a statistical hypothesis as incorrect or unacceptable, a boundary must be set for the level at which the chance possibility is too small to be taken seriously. The use of this rejection boundary is the main difference in the basic reasoning for evaluating mathematical hypotheses in geometry and in statistics.

The mathematical nomenclature, however, is used for a thought process that is drastically different from the often complex details and concepts of a scientific hypothesis. In science, the hypothesis is usually a specific substantive idea, such as “DNA is structured as a double helix” or “Vigorous control of elevated blood sugar will prevent vascular complications.” In statistical inference, however, the hypotheses are strictly mathematical, and the conclusions refer not to anything substantive, but to the role of random-chance probability in the numerical results. The mathematical reasoning is the same, regardless of where the data come from and regardless of what they represent. The scientific hypothesis may be brilliant or foolish; the data may be accurate or wildly wrong; the comparison may be fair or grossly biased; but the statistical hypothesis does not know or care about these distinctions as it does its purely mathematical job.

The word stochastic, introduced in Section 10.1, is also a useful name for examining the possible events that might arise by chance under a mathematical hypothesis. Stochastic hypotheses are always stated in concise, simple mathematical symbols, such as H 0 : µA = µB. In this set of symbols, H0 denotes the null hypothesis, and µA and µB are the hypothesized parametric means for groups A and B.

© 2002 by Chapman & Hall/CRC

Stochastic hypotheses are commonly tested to evaluate stability of a numerical contrast for two (or more) groups. When stability was examined for only a single group in Chapters 7 and 8, the observed data could be rearranged in only a limited manner. When more than one group of data is available, however, diverse rearrangements can be constructed. For those constructions, the stochastic hypotheses are used for assumptions that can be applied both in making the rearrangements and in drawing conclusions afterward.

This chapter is concerned with the strategy of forming hypotheses and making the subsequent decisions. Chapters 12 through 15 describe the specific “tests” that produce the rearrangements and results used for the decisions.

11.2 Basic Strategies in Rearranging Data

The popular statistical procedures used to rearrange data for two groups have such well-known names as t-test, Z-test, chi-square test, and Fisher exact probability test. Other procedures, such as the Pit- man-Welch test and Wilcoxon test, are less well known.

Regardless of their fame, the procedures all use the same basic strategy: forming a hypothesis, contemplating a distribution, and reaching a decision. The procedures differ in the method used to form the rearranged distributions; and the tactic chosen for the rearrangement will determine the way in which the hypothesis is stated and evaluated. Regardless of how the procedure is done, however, rejection of the hypothesis leads to the conclusion that the comparison is “statistically significant” and that the contrasted results are stable.

11.2.1Parametric Sampling

Parametric sampling is the traditional basis of the “rearrangements” used for testing statistical hypotheses and making inferential conclusions. The sequence of events in the parametric method for contrasting two groups has many similarities to what was done in Chapters 7 and 8 for evaluating a single group.

1.Using features of the observed data, parameters are estimated for a parent population.

2.Repetitive samples are theoretically drawn from the parent population, but each “sampling” consists of two groups rather than one.

3.The anticipated results in the array of two theoretical samples are converted to the value of a single group’s “test statistic,” such as t or Z.

4.The pattern of results for the selected test statistic will have a specific mathematical distribution, from which a P value can be determined.

5.Instead of a P value, a confidence interval can be demarcated, using appropriate theoretical principles, for the location of the parameter.

6.From the P value or confidence interval, a decision is made to reject or not reject the selected hypothesis.

All of these steps occur in parametric tests of inference, whether aimed at a single central index or at a contrast of two central indexes. When two central indexes are compared, however, the procedure has a different goal; the population parameters are estimated in a different way; and a different type of conclusion is drawn. The goal is to determine whether the contrast in the two central indexes is distinctive enough to depart from the hypothesized parameter. The parameters for the theoretical parent populations are usually estimated under a “null hypothesis” that they are the same; and if the null hypothesis is rejected, we conclude that the observed indexes are stochastically different.

The mathematical reasoning is as follows: Suppose XA and XB are observed as the mean values for two groups with sizes nA and nB. We assume that each group is a random sample from corresponding parent populations having the parametric means A and B. Using the same respective group sizes, nA and nB, we now repeatedly take theoretical random samples from these two populations. Each pair of

© 2002 by Chapman & Hall/CRC

samples has the mean values XAj and XBj and the increment XAj – XBj . As the sampling process continues, the results form a series of increments in two sample means, {XAj – XBj } . This series of increments can be regarded as coming from the single group of a third parent population, which consists

solely of increments formed by each of the items XAj – XBj . Applying the null hypothesis, we now assume that µA = µB, i.e., that the two groups come from parent populations having the same parametric mean. With the null-hypothesis assumption that µA − µB = 0, the parametric mean of the third population will be 0. The observed value of XA – XB is then regarded as a sampling variation among increments of means in samples taken from the third population, whose parametric mean is µ = 0.

For example, suppose the observed results are XA = 7 and XB = 12 , with XA – XB = –5. If we drew repeated sets of two samples from a parent population and calculated the difference in means for each set, we might get a series of values such as 2, 1, 7, 4, 8, …. These values would form the sampling distribution for a difference in two means, taken from a theoretical population whose mean difference is assumed to be 0. The theoretical sampling distribution and the result of 5 that was actually observed in the two groups are then appraised for the decision about P values and confidence intervals.

The Z test and t test, which will be discussed in Chapter 13, are the two most common parametric procedures used for this purpose. The popular chi-square test, applied to binary data, will be discussed in Chapter 14.

11.2.2Empirical Procedures

Although theoretical parametric sampling is the traditional statistical method of forming rearrangements, modern electronic computation has allowed two additional strategies to evaluate stability of a contrast. The new strategies are called empirical, because they depend only on the observed data, without invoking any theoretical populations or anticipated parameters. The two types of empirical methods are permutation (or randomization) procedures, which are discussed in the next section, and bootstrap procedures, discussed in Section 11.2.2.2.

11.2.2.1 Permutation Tests — For a permutation test of two groups, the data are first combined into a single larger group, which is pooled under the null hypothesis that the distinguishing factor (such as treatment) has the same effect in both groups. The pooled data are then permuted into all possible arrangements of pairs of samples having the same size as the original two groups. The index of contrast, such as an incremental mean, is determined for each of these paired samples. The distribution of the indexes of contrast is then examined to determine P values under the null hypothesis.

For example, consider the two groups of data {1, 2} and {3, 4}, having the respective means 1.5 and 3.5. If the two groups are pooled into a single “population,” {1, 2, 3, 4}, Table 11.1 shows the six possible permuted arrangements that divide the data into two groups, each with two members. The table also shows the mean and increment in means for each pair of samples. If one pair of samples were randomly selected from these six possibilities, the chances would be 2/6 (= .33) for getting an incremental value of 0, 1/6 for a value of +1.0, 1/6 for 1.0, and so on.

TABLE 11.1

Distribution of Incremental Means in Permutation Procedure for

Two Groups of Data, {1, 2} and {3, 4}

 

 

 

 

 

 

 

 

 

 

 

 

 

Sample A

XA

Sample B

XB

XB XA

1, 2

1.5

3, 4

3.5

2.0

 

 

1, 3

2.0

2, 4

3.0

1.0

 

 

1, 4

2.5

2, 3

2.5

0

 

 

2, 3

2.5

1, 4

2.5

0

 

 

2, 4

3.0

1, 3

2.0

1.0

 

3, 4

3.5

1, 2

1.5

2.0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

© 2002 by Chapman & Hall/CRC

Permutation tests are also called randomization tests, for two reasons. First, the subsequent P values denote probabilities for random occurrence of any of the permuted arrangements. Second, large group sizes can produce too many possible permutations for display and examination of the complete tabulation, as illustrated in Table 11.1. For example, about 1.38 × 10 11 permuted arrangements can be formed from two groups that each contain 20 members. The appraisal of the full display can be dramatically eased, however, with some of the condensation tactics discussed in Chapter 12.

If those condensations cannot be applied, a different “shortcut” approach is to form permutations as a resampling exercise, by assigning members of the pooled array of data randomly (without replacement) to each of the two compared groups. As each new pair of samples is generated, the corresponding index of contrast is noted and added to the distribution of those indexes. The distribution can then be examined for P values (or other attributes) after a suitable number of pairs of samples has been randomly generated. The total number of pairs of samples can be quite large, ranging from 1000 to 10,000 or more. Nevertheless, for two groups containing 20 members each, the distribution of indexes of contrast will be easier to obtain from 10,000 samplings than from the total of 1.38 × 10 11 possible arrangements.

The term Monte Carlo sampling can be applied for this approach, which checks a large but truncated series of samples, rather than all possibilities. Monte Carlo sampling relies on applying random choices to an underlying model. In this instance, the underlying model forms a permuted arrangement of the available data, rather than a sampling-with-replacement procedure, which is used for bootstrap tests.

The empirical permutation methods for contrasting two groups have existed for many years, long before modern computation became available; and the methods have many names. They are sometimes called non-parametric, because no parameters are estimated, or distribution-free, because the resampled results generate their own ad hoc distribution, without involving a theoretical mathematical model. The best known empirical procedure is a permutation technique, discussed in Chapter 12, that is eponymically called the Fisher exact probability test or Fisher exact test. Other reasonably well-known empirical procedures, more commonly called non-parametric rank tests, are the Wilcoxon signed-ranks test and the Mann-Whitney U test; they are discussed in Chapter 15.

11.2.2.2 Bootstrap Tests — In a permutation “resampling,” the existing members of data are rearranged to form distinctive combinations. In a bootstrap resampling, the observed data are used as a “parent population,” from which random sampling is done, with each individual member being replaced after its random selection. The permutation procedure requires a pooling of two (or more) groups of data, which can then be rearranged appropriately. The bootstrap resampling, however, can be done with a single group of data, as shown earlier in Section 6.4.1.1. Thus, the group of data {1, 6, 9} could form 27 possible resampled groups ranging from {1, 1, 1} to {9, 9, 9}, as shown in Table 6.1.

Bootstrap resampling is seldom used to compare two groups of data, but can be employed in two different ways to construct confidence intervals or P values for the index of contrast.

11.2.2.2.1Confidence Interval. For a confidence interval, each group is maintained separately; and a resampling is done, with replacement, within the group. The results of such a resampling for the previous two groups {1, 2} and {3, 4} are shown in Table 11.2.

Each group can produce four bootstrapped samples with corresponding means, shown in the upper part of the table. Each of the four samples for one group can be matched by one of four samples from the second group, and the 16 possible increments in means are shown in the lower half of the table. The distribution of the 16 increments shows values of 1.0 and 3.0 each occurring once, 1.5 and 2.5 each occurring four times, and 2.0 occurring six times. The range spanned from 1.5 to 2.5 would include 14 or 87.5% of the 16 possible values. Thus, the observed increment of 2 would be surrounded by an 87.5% confidence interval that extends from 1.5 to 2.5.

11.2.2.2.2P Value. For a P value, the two groups are pooled, and pairs of samples, containing two

members each, are formed, with replacement. As 4 possible choices can be made each time, a total of 16 (= 4 × 4) samples can be obtained for each group, and the increment of means in the two samples

© 2002 by Chapman & Hall/CRC

TABLE 11.2

Bootstrap Procedure to Form Confidence Interval for Incremental

Means of Two Groups, {1, 2} vs. {3, 4}

Bootstrapped Samples

Group {1, 2}

 

 

Group {3, 4}

Contents

Mean

 

Contents

Mean

 

 

 

 

 

 

1, 1

 

1.0

 

3, 3

3.0

1, 2

 

1.5

 

3, 4

3.5

2, 1

 

1.5

 

4, 3

3.5

2, 2

 

2.0

 

4, 4

4.0

 

 

 

Incremental Means in 16 Possible Bootstrapped Samples

 

Mean in

 

 

Mean in Sample {3, 4}

 

Sample {1, 2}

3.0

3.5

3.5

4.0

 

 

 

 

 

 

1.0

 

2.0

2.5

2.5

3.0

1.5

 

1.5

2.0

2.0

2.5

1.5

 

1.5

2.0

2.0

2.5

2.0

 

1.0

1.5

1.5

2.0

 

 

 

 

 

 

can be formed in16 × 16 = 256 ways. The distribution of increments can extend from 0, when the two samples have similar means, to a peak value of 3, when the compared samples are {1, 1} vs. {4, 4}.

Neither of the two bootstrap methods is regularly used for contrasts of two groups, although the methods can often be applied for other stochastic challenges.

11.2.3Relocation Procedures

A different new strategy, which depends on “relocations” rather than resamplings, is analogous to the jackknife procedure. The jackknife itself is seldom used in ‘‘elementary” statistics but is often applied in multivariable statistics, as discussed elsewhere,1 for getting or checking the values of the estimated parameters. The tactic about to be discussed now is a type of simultaneous jackknife maneuver for two groups. With only one group available in Chapter 7, the jackknife tactic could do no more than remove members from the group. With two groups available, members can be exchanged or otherwise relocated.

11.2.3.1 Unit Fragility Test — The relocations create a series of altered groups, analogous to the altered series produced by the jackknife removals in Section 7.7.3. For these altered groups, the most interesting stochastic approach is to compare the central indexes descriptively. Instead of examining distributions to determine P values and confidence intervals, we check to see whether the differences (or other distinctions) in the compared indexes exceed the boundaries selected for quantitative zones of significance or insignificance.

For example, suppose the proportions of success are being compared as pA pB for two treatments, where A is expected to be better than B. Suppose the value of δ ≥ .15 is set as the boundary for an increment that is quantitatively significant, and that ζ ≤ .04 is demarcated as the boundary for quantitatively insignificant increments. With these boundaries, the compared result will be deemed quantita-

tively significant if pA

pB is .15, insignificant if pA pB is .04, and inconclusive in the intermediate

zone where .04 < (pA

pB) < .15.

To illustrate the relocation process, suppose the results of a clinical trial show pA = 10/20 = .500 and pB = 6/18 = .333. The observed increment of .500 .333 = .167 would be regarded as quantitatively significant because it exceeds δ = .15. If one member of the numerator group in the larger pA were moved from A to B, however, the result would become pA = 9/20 = .450 and pB = 7/18 = .389. The increment of .450 .389 = .061 would no longer be quantitatively significant. We might therefore decide

© 2002 by Chapman & Hall/CRC

that the originally observed “significant” result is not statistically stable because the boundary of quantitative significance would no longer be exceeded if one person were relocated.

The relocation strategy, called the unit fragility procedure, has been proposed2 for evaluating the

change that might occur in two proportions, pA and pB, if a single unit is moved from one numerator to

the other. Thus, if pA = rA/nA and pB = rB/nB, the new proportions might be pA′ = (rA

+ 1)/nA and pB

= (rB

1)/nB. If the move went in the other direction, the new proportions would be

p

= (rA

1)/nA

 

A

and

p

= (rB

+ 1)/nB. With the first change, the comparison of 10/20 = .500 vs. 6/18 = .333 would

B

become 11/20

= .550 vs. 5/18 = .278, an increment of .272. With the second change, the results would

become 9/20 = .450 vs. 7/18 = .389, an increment of .061.

 

 

 

 

The changes could be evaluated either intrinsically or against an extrinsic standard, such as δ

.15

(or ζ

.04). For the intrinsic evaluation, the increment between the two proportions in one instance

would rise by .105 [= .272 .167], and in the other instance, the increment would fall by .106 [= .167

.061]. The absolute amount of change in the increment is the same (except for rounding) in either

direction, whether the shifted unit makes the value of rA become rA + 1 or rA 1.

The amount of change, called the unit fragility , can be expressed with the formula, f = N/(n1n2) where N = n1 + n2. This result for 10/20 vs. 6/18 is 38/(20 × 18) = .106. Thus, with the unit fragility procedure, the observed increment of .167 would rise or fall by a value of .106. In reference to change from the intrinsic value of the original increment of .167, the index of proportionate fragility is relatively high, at .106/.167 = .63. In reference to an extrinsic boundary, the reduction of .106 would make the original increment become .061, which is no longer quantitatively significant. With either type of boundary, we could conclude that the observed contrast of 10/20 vs. 6/18 is not stable.

Perhaps the most striking feature of the unit fragility procedure is the appraisal of “statistical significance” in a purely descriptive manner, without recourse to probabilities. When quantitative boundaries are established for “significant” and “insignificant” differences, the fragility (an inverse of “stability”) for the observed result can be determined by whether it crosses those boundaries after a potential unitary change.

11.2.3.2 Application in Mental Screening — The idea of checking stability without resorting to probability is a striking departure from a century of the now traditional paradigms of “statistical inference.” Although the new approach has received serious discussion,2,3 many years will probably elapse before its value becomes recognized by investigators or accepted by statisticians. Regardless of the ultimate fate of the unit fragility tactic, it offers an excellent method for doing a prompt ‘‘in-the-head-without-a-calculator” appraisal of the observed results. With this type of “mental screening,” the analyst evaluates the data from the counterpart of a simple “physical examination,” before doing any calculations as a “laboratory work-up.”

This type of screening was done, without being called “unit fragility,” when decisions were made earlier in Section 1.1.3 that the comparison of .500 vs. .333 was unstable as 1/2 vs. 1/3 and stable as 150/300 vs. 100/300. In the first instance, a unit fragility shift could reverse the direction of the increment from .500 .333 = +.167 to (0/2) (2/3) = .667. In the second instance, a one-unit shift would make the altered increment become (149/300) (101/300) = +.160, which hardly changes the original incremental result of .167. For small numbers, such as 1/2 vs. 1/3, the comparison is easily done with mental rearrangements that do not require a calculator.

Exercise 11.1 offers an opportunity to try this type of “mental screening” for a set of dimensional data.

11.3 Formation of Stochastic Hypotheses

The fragility procedure creates a new type of “inference-free” statistics that may not become popular for many years; and the standard, customary inferential procedures will be individually discussed in Chapters 12 through 15. The rest of this chapter is therefore devoted to the traditional inferential

© 2002 by Chapman & Hall/CRC

reasoning that occurs when statistical hypotheses lead to conventional decisions about stochastic significance. The basic principles used for a contrast of two groups are also applicable to most other types of stochastic contrast.

In the traditional reasoning, hypotheses are established for tests that answer the question, “What if?”; and the answers always depend on stochastic probabilities found in theoretical or empirical random sampling from a distribution. The hypotheses will differ according to the type of question being asked in each test, the type of answer that is desired, and the procedure selected to explore the questions and answers. Nevertheless, certain basic concepts are fundamental to all the procedures, regardless of how they are done.

11.4 Statement of Hypothesis

As the opposite of what we want to prove, the null hypothesis is set up for the goal of being rejected, i.e., declared nullified. With the usual aim of confirming that the observed difference is big, important, or “significant,” the opposite null hypothesis is customarily set at the value of 0. In parametric procedures, the statement for means would be H0: µA µB = 0 (which is H0 : µA = µB), and for proportions, H0: π A = π B.

In other instances, to be discussed in Section 11.8 and Chapters 23 and 24, the investigator wants to confirm that an observed difference is small, unimportant, or “insignificant.” For this purpose, the stochastic hypothesis is set at a nonzero large value, such as δ , with a parametric statement such as H0: µA µB ≥ δ . In the rest of this chapter (and in Chapters 12 through 15) the null hypotheses are all set essentially at 0, but the distinction in nomenclature should be kept in mind to avoid confusion later. The null hypothesis is called null because it is being evaluated for rejection, not because its value is 0.

11.5 Direction of the Counter-Hypothesis

The counter-hypothesis represents the contention to be supported or the conclusion to be drawn when the null hypothesis is rejected. In the usual logic of the mathematical arrangement, the statistical counter-hypothesis represents the investigator’s goal in doing the research. Thus, if the aim of a clinical trial is to show that Treatment A is better than Treatment B, the investigator’s goal is A > B. When the statistical null hypothesis is stated as A = B, the original goal of A > B becomes the counter-hypothesis.

A prime source of the one-tail vs. two-tail dispute in interpreting probabilities is the direction of the counter-hypothesis. Suppose the null hypothesis is stated as A B = C. If the hypothesis is rejected, the conclusion, which is A B C, states an inequality, but not a direction. It does not indicate whether A B is > C or < C.

For example, if the research shows that XA – XB = 5 , do we want to support the idea that XA is at least 5 units larger than XB ? If so, the counter-hypothesis is XA – XB 5 . Do we also, however, want

to support the possibility that XB might have been at least 5 units larger than XA ? For this bidirectional decision, the counter-hypothesis is XA – XB 5 .

The choice of a unior bidirectional counter-hypothesis is a fundamental scientific issue in planning the research and interpreting the results. The issue was briefly discussed in Chapter 6, and will be reconsidered in Section 11.8.

11.6 Focus of Stochastic Decision

The focus of the stochastic decision can be a P value, a confidence interval, or both.

© 2002 by Chapman & Hall/CRC

11.6.1P Values

The customary null hypothesis makes an assumption about “equivalence” for the two compared groups. If searching for a P value, we determine an external probability for the possibility that the observed difference (or an even larger one) would occur by stochastic chance if the hypothesis about equivalence is correct.

11.6.1.1Parametric Tests — In parametric testing, the concept of equivalence refers to param -

eters. If the two compared groups have mean values XA and XB, we assume that the groups are random samples from parent populations having the identical parameters A = B.

11.6.1.2Empirical Tests — In empirical procedures, if no parameters are involved or estimated, the hypothesis of equivalence refers to the treatments, risk factors, or whatever distinguishing features are being compared in the groups. To contrast success rates for rearranged groups receiving either Treatment A or Treatment B, we can assume that the two treatments are actually equivalent. The scientific

symbols for this stochastic hypothesis are TA TB.

11.6.2Confidence Intervals

With either parametric or bootstrap procedures for confidence intervals, we examine the array of possible results that would occur with rearrangements of the observed data. The methods of forming these rearrangements will depend on whether the stochastic hypothesis is set at the null value (of equivalence for the two groups) or at an alternative value, discussed in Section 11.9, which assumes that the two groups are different.

When the rearranged results are examined, the decision about an acceptable boundary can be made according to intrinsic or extrinsic criteria. In one approach, using intrinsic criteria, the main issue is whether the hypothesized parameter is contained internally within the confidence interval. Thus, in the conventional parametric approach, with the null hypothesis that = 0, the hypothesis is rejected if the value of 0 is excluded from the estimated confidence interval. For example, if the observed difference is XB – XA = 7 , we would check stochastically to see whether the hypothesized parametric value of 0 is contained in the confidence interval constructed around 7.

The second approach depends on an extrinsic descriptive boundary that indicates how large (or small) the observed difference might really be. For this type of decision, regardless of whether 0 is included in the confidence interval, we might want to check that the interval excludes a value as large as 20.

11.6.2.1 Parametric Tests — With parametric testing, a confidence interval is constructed around the observed distinction, which is usually an increment such as XA – XB . If the two groups are stochastically different, the null-hypothesis parametric value of A B = = 0 will be excluded from this interval. The ultimate decision may therefore rest on four constituents: (1) the selected level of confidence, (2) the magnitude of the zone formed by the confidence interval, (3) inclusion (or exclusion) of the “true” parametric value of 0 in that zone, and (4) inclusion (or exclusion) of any “undesirable” or inappropriate values.

For example, suppose two means have an incremental difference of 175 units, and suppose we find that the 95% confidence interval for this difference is constructed as 175 ± 174, thus extending from 1 to 349 units. Since the parametric value of 0 is not included in this zone, we might reject the null hypothesis and conclude that the two means are stochastically different. On the other hand, because their true difference might be anywhere from 1 to 349 units, we might not feel secure that the observed difference of 175 is a precise or stable value, despite the “95% confidence.”

11.6.2.2 Empirical Procedures — With empirical procedures, we inspect results for the indexes of contrast in the series of resamples, as discussed earlier (and later in Chapter 12). A series of permuted

© 2002 by Chapman & Hall/CRC

samples can be prepared under the “null hypothesis,” but a specific confidence interval is not constructed around the observed difference. Consequently, the main goal in inspecting the permuted results is to note the range of possible values. This range can be expressed in the counterpart of a zone of percentiles. Thus, if two groups have XB – XA = 7 as the difference in means, a permutation procedure might show that 95% of the possible differences extend from 8 to +36. With a bootstrapping procedure, however, the confidence intervals show the array of results that can occur around the observed difference.

11.6.3Relocation Procedures

Relocation procedures do not use the customary forms of statistical inference. The strategy is stochastic because it answers a question about what might happen, but no mathematical hypotheses are established, and no probability values are determined. The decision depends on potential changes in the observed results. For the “unit fragility” procedure discussed in Section 11.2.3.1, these changes were evaluated with a form of reasoning analogous to confidence intervals.

11.7 Focus of Rejection

Because the main goal of stochastic hypothesis testing is to determine whether the hypothesis should be rejected, a focus must be established for the rejection.

11.7.1P Values

For P values, an α level is chosen in advance as the critical boundary, and the hypothesis is rejected if the test procedure produces P ≤ α . (As noted later in Section 11.12, the value of α is often set at .05.) If P > α , the hypothesis is conceded but not actually accepted.

The reason for the latter distinction is that rejecting the null hypothesis of equivalence allows the stochastic conclusion that the parameters or treatments are different, but much stronger evidence is needed to accept the null hypothesis and conclude that they are essentially equivalent. The absence of proof of a difference is not the same as proof that a difference is absent. For example, if two treatments have success rates of .25 vs. .40, coming from 1/4 vs. 2/5, we cannot reject the stochastic null hypothesis. Nevertheless, we could not be confident that the two treatments are actually equivalent.

The reasoning for a stochastic hypothesis thus has three possible conclusions: rejected, conceded, and accepted. These three categories are analogous to the verdicts available to Scottish juries: guilty, not proven, and not guilty. Accordingly, we can reject or concede a null hypothesis, but it is not accepted without further testing. The stochastic procedures needed to confirm “no difference” will be discussed later in Chapters 23 and 24.

11.7.2Confidence Intervals

A confidence interval is usually calculated with a selected test statistic, such as the Zα or tν ,α discussed in Section 7.5.4, that establishes a 1 α zone for the boundaries of the interval. This zone can be evaluated for several foci. The first is whether the anticipated (null-hypothesis) parameter lies inside or outside the zone. If the parametric value of µ is not contained in the zone of the interval, we can conclude that P ≤ α and can reject the null hypothesis. This result of this tactic is thus an exact counterpart of the reasoning used for α and P values.

Two more foci of evaluation are the upper and lower boundaries of the interval itself. Do these boundaries include or exclude any critical descriptive characteristics of the data? For example, suppose the increment of 175 units in two means has a confidence interval of 175 ± 200 and extends from 25 to 375. Because 0 is included in this interval, we cannot reject the hypothesis that the two groups are parametrically similar. On the other hand, because of the high upper boundary of the confidence interval, we could also not reject an alternative hypothesis that the two groups really differ by as much as 350

© 2002 by Chapman & Hall/CRC

units. Conversely, as discussed earlier, if the confidence interval is 175 ± 174, and goes from 1 to 349, it excludes 0. The null hypothesis could be rejected with the conclusion that the two groups are “significantly” different. Nevertheless, the true parametric difference might actually be as little as 1 unit.

This double role of confidence intervals—offering an inferential estimate for both a parameter and descriptive boundaries—has elicited enthusiastic recommendations in recent years that the P value strategy be replaced by confidence intervals. Some of the arguments for and against the abandonment of P values will be discussed later in Section 11.12 and again in Chapter 13.

11.7.3Relocation Procedures

For P values and confidence intervals, the rejection of the stochastic hypothesis will depend on the magnitudes selected either for α or for the corresponding Zα or tυ ,α . Relocation decisions, however, depend on the descriptive boundaries set quantitatively for the large “significant” δ or the small “insignificant” ζ . The quantitative boundaries, which have received almost no attention during all the stochastic emphases, are crucial for decisions with relocation procedures, but are also needed both to evaluaute the extreme ends of confidence intervals and, as discussed later, to calculate sample sizes or to establish alternative stochastic hypotheses.

11.8 Effect of Oneor Two-Tailed Directions

The choice of a one-tailed or two-tailed direction for the counter hypothesis determines how to interpret an observed P value, or how to choose the level of α used in forming a confidence interval.

11.8.1Construction of One-Tailed Confidence Intervals

Just as a two-tailed P value of .08 becomes .04 in a one-tailed interpretation, the chosen level of α for a 1 α confidence interval really becomes α /2, if we examine only one side of the interval. To illustrate this point, suppose Zα is set at Z.05 for a 95% confidence interval calculated as X ± Z.05 (s /n ) . The upper half of the interval includes .475 of the distribution of means that are potentially larger than X , and the lower half of the interval includes the other .475 of the distribution, formed by means that are potentially smaller. If we are interested only in the upper boundary and ignore the lower one, however, the lower half of the interval really includes .50 of the distribution, i.e., all of the potentially smaller values. The confidence interval would be larger than the stated level of .95, because it would really cover .975 = 1 .025 = 1 (α /2) of the potential values.

In the first example of Section 11.7.2, suppose we wanted the 175 unit increment in two means to be definitely compatible with a parametric value of 350. The lower half of a two-tailed 95% confidence interval would include only a .475 proportion of the values that are potentially smaller than 175. If we are not interested in any of the smaller values, however, and want to know only about the larger ones, we would dismiss all .50, not just .475, of the potential values that are smaller than 175.

Accordingly, if we want a decision level of α for a strictly one-tailed confidence interval, examining only one boundary or the other but not both, the originally chosen α should be 2α , which will become α when halved. Therefore, for a strictly one-tailed confidence interval at the .05 level of α , the calculation would be done with Z.1 = 1.645, rather than Z.05 = 1.96.

The ± sign in the customary calculation can be confusing for the idea of a one-tailed confidence interval. The lower and upper value produced by construction of central index ± [Zα (standard error)] will regularly seem strange for a decision that allegedly goes in only one direction. In proper usage, however, a one-tailed confidence interval should be constructed with a + or sign, but not both. Thus, if we want to reject the null hypothesis for a positive result, such as d, the one-tailed confidence interval is calculated as d [Zα (standard error)] and then checked to see if it excludes 0. If we want to consider the alternative possibility that d is really much larger, the one-tailed interval would be d + [Zα (standard error)], which would be checked to see if it excludes the larger value.

© 2002 by Chapman & Hall/CRC