Ординатура / Офтальмология / Английские материалы / Principles Of Medical Statistics_Feinstein_2002
.pdf
TABLE 15.6
Initial and Subsequent Levels of Serum Cholesterol
Patient |
(B) Initial |
(A) Subsequent |
(B – A) |
|
Number |
Level |
Level |
Increment |
Rank |
|
|
|
|
|
1 |
209 |
183 |
26 |
4 |
2 |
276 |
242 |
34 |
5.5 |
3 |
223 |
235 |
−12 |
3 |
4 |
219 |
185 |
34 |
5.5 |
5 |
241 |
235 |
6 |
1 |
6 |
236 |
247 |
−11 |
2 |
|
|
|
|
|
i.e., regardless of negative or positive signs, we take the sum of ranks for all the positive increments and compare it stochastically with the sum of ranks for all the negative increments. Under the null hypothesis, we would expect these two sums to be about the same. If their observed difference is an uncommon enough event stochastically, we would reject the null hypothesis.
15.6.2Finding the Ranks of Increments
The first step is to examine the increments, B – A, in the observed paired values for each patient. (We expect most of these increments to be positive if the treatment is effective in lowering cholesterol.)
The next step is to rank the differences in order of magnitude, without regard to sign, from smallest to largest. In Table 15.6, the smallest increment, 6, is ranked as 1. The next largest increment, −11, is ranked as 2 (even though −11 is arithmetically less than 6). The next increment, −12, is ranked 3; and 26 is ranked 4.
A problem arises when two (or more) increments are tied for the same rank. This problem is solved, as discussed in Section 15.5.2, by giving each increment the rank that is the average value of the ranks for the tied set. Thus, in the foregoing table, the incremental values of 34 and 34 are tied for the 5th rank. Since these two numbers would occupy the 5th and 6th ranks, they are each ranked as 5.5. [If three increments were tied for the 5th rank, they would each be ranked as 6, since the average rank of the three would be (5 + 7)/2 = 6. Note that the next increment after those three ties would be ranked as 8, not as 6 or 7. If you forget this distinction and rank the 8th increment as 6 or 7, the results will be wrong.]
15.6.3Finding the Sum of Signed Ranks
The next step is to find the sum of all the ranks associated with negative and positive increments. In Table 15.5, the two negative increments (−11 and −12) have a rank sum of 2 + 3 = 5. The four positive increments have a rank sum of 1 + 4 + 5.5 + 5.5 = 16.
We could have determined the value of 16 either by adding the ranks for the four latter increments, or by recalling, from the formula (n)(n + 1)/2, that the total sum of ranks should be (6)(7)/2 = 21. Thus, if the negative increments have a rank sum of 5, the positive increments must be 21 − 5 = 16. The smaller value for the sum of the two sets of ranks is a stochastic index designated as T. In this instance, T = 5.
15.6.4Sampling Distribution of T
The values of T have a sampling distribution for different values of n. Under the null hypothesis, anticipating that about half the signed ranks will be positive and half will be negative, we would expect T to be half of (n)(n + 1)/2, which in this instance, would be 21/2 = 10.5. The smaller the value of T, the greater will be its departure from this expectation, and the less likely is its distinction to arise by chance.
© 2002 by Chapman & Hall/CRC
15.6.4.1 Total Number of Possible Values for T — As with any other complete resampling distribution, we first need to know the number of possible values that can occur for T. This result will be the denominator for all subsequent determinations of relative frequencies.
In a contrast of two groups, having n1 and n2 members, the number of possible arrangements was N!/(n1!)(n2!), where N = n1 + n2. This approach will not work here because the arrangements involve one group, not two, and also because the permuted ranks can be assembled in various groupings. The six “signed” ranks (in the example here) may or may not all have positive values; and the positive values can occur in none, 1, 2, 3, 4, 5, or all 6 of the signed ranks.
When each possibility is considered, there is only one way in which all 6 ranks can be negative and one way in which all 6 are positive. There are 6 ways in which only one of the ranks is positive, and 6 ways in which five of the ranks are negative (i.e., one is positive). There are 15 ways (= 6 × 5 /2) in which two ranks are positive, and 15 in which two are negative (i.e., four are positive). Finally, there are 20 ways in which three ranks will be positive, with the other three being negative. (The 20 is calculated as 6!/[(3!)(3!)]). Thus, there are 1 + 1 + 6 + 6 + 15 + 15 + 20 = 64 possible values for the number of positive rank combinations when n = 6. A general formula for this process is Σ nr =0 (n!)/[(n − r)!(r)!]. For n = 6,
64 = ---------6! |
+ ---------6! |
+ ---------6! |
+ ---------6! + ---------6! + ---------6! + ---------6! |
|||
6!0! |
5!1! |
4!2! |
3!3! |
2!4! |
1!5! |
0!6! |
15.6.4.2 Frequency of Individual Values of T — Under the null hypothesis, for n = 6, we expect T to have a value of 21/2 = 10.5. Therefore, if T is the smaller sum of signed ranks, we need to determine only the ways of getting values of T ≤ 10. Each of the possibilities in Table 15.7 shows the arrangement that will yield the cited value of T for a sum of six ranks. The number of occurrences is then divided by 64 (the total number of possibilities) to show the relative and cumulative frequency for each occurrence.
TABLE 15.7
Values of T for Possible Arrangements of Positive Values of Six Signed Ranks
|
|
|
|
Cumulative |
|
Identity of Positive Ranks That Are |
Number of |
Relative |
Relative |
Value of T |
Components for This Value of T |
Occurrences |
Frequency |
Frequency |
|
|
|
|
|
0 |
None |
1 |
1/64 = .016 |
.016 |
1 |
1 |
1 |
.016 |
.031 |
2 |
2 |
1 |
.016 |
.047 |
3 |
{3}, {1,2} |
2 |
.031 |
.078 |
4 |
{4}, {1,3} |
2 |
.031 |
.109 |
5 |
{5}, {1,4}, {2,3} |
3 |
.046 |
.156 |
6 |
{6}, {1,5}, {2,4}, {1,2,3} |
4 |
.062 |
.218 |
7 |
{1,6}, {2,5}, {3,4}, {1,2,4} |
4 |
.062 |
.281 |
8 |
{2,6}, {3,5}, {1,2,5}, {1,3,4} |
4 |
.062 |
.343 |
9 |
{3,6}, {4,5}, {1,2,6}, {1,3,5}, {2,3,4} |
5 |
.078 |
.421 |
10 |
{4,6}, {1,3,6}, {1,4,5}, {2,3,5} {1,2,3,4} |
5 |
.078 |
.500 |
11 |
{5,6}, {1,4,6}, {2,3,6}, {2,4,5}, {1,2,3,5} |
5 |
.078 |
.578 |
|
|
|
|
|
Table 15.7 need not ordinarily be extended beyond T = 10, because if T becomes 11, we have begun to look at the other, i.e., negative, side of the summed ranks. When the absolute values are summed for the negative ranks, their result will be T = 10. For example, the 5 groupings that produce T = 11 in Table 15.7 each contain the complement ranks omitted from the counterpart groupings when T = 10. (The “complement” of {4, 6} is {1, 2, 3, 5}. The “complement” of T = 0 is {1, 2, 3, 4, 5, 6}, for which T = 21.)
The cumulative relative frequencies in Table 15.7 show the one-tailed P values for any observed positive value of T. These are one-tailed probabilities because a symmetrically similar distribution could
© 2002 by Chapman & Hall/CRC
be constructed for the negative signed ranks. To interpret the results, if T = 4, P = .109, and if T = 3, P = .078. If we want the one-tailed P to be below .05, T must be ≤ 2. For the two-tailed P to be below
.05, T must be 0.
Arrangements such as Table 15.7 can be constructed for any value of n. A summary of the key values found in such constructions is presented here in Table 15.8, which can be used to find the P values associated with T. For example, Table 15.8 shows what we have just found for the circumstance when n = 6: a T of 2 or less is required for two-tailed P to be ð.1, and T must be zero for a two-tailed P of ð .05.
TABLE 15.8
Critical Values of T for Signed-Ranks Test
|
|
P: two-tailed test |
|
|
|
.1 |
.05 |
.02 |
.01 |
|
|
|
|
|
n |
|
|
|
|
6 |
2 |
0 |
|
|
7 |
3 |
2 |
0 |
|
8 |
5 |
3 |
1 |
0 |
9 |
8 |
5 |
3 |
1 |
10 |
10 |
8 |
5 |
3 |
11 |
13 |
10 |
7 |
5 |
12 |
17 |
13 |
9 |
7 |
13 |
21 |
17 |
12 |
9 |
14 |
25 |
21 |
15 |
12 |
15 |
30 |
25 |
19 |
15 |
16 |
35 |
29 |
23 |
19 |
17 |
41 |
34 |
27 |
23 |
18 |
47 |
40 |
32 |
27 |
19 |
53 |
46 |
37 |
32 |
20 |
60 |
52 |
43 |
37 |
21 |
67 |
58 |
49 |
42 |
22 |
75 |
65 |
55 |
48 |
23 |
83 |
73 |
62 |
54 |
24 |
91 |
81 |
69 |
61 |
25 |
100 |
89 |
76 |
68 |
This table is derived from Smart, J.V. Elements of Medical Statistics.
Springfield, IL: Charles C Thomas, 1965.
15.6.5Use of T/P Table
To use Table 15.8, a suitable value must be chosen for n. If any of the observed increments in the data are 0, they cannot be counted as either positive or negative when the sums of ranks are formed. Therefore, zero increments should not be ranked, and the value of n should be the original number of pairs minus the number of pairs that had zero increments.
The 6 patients in Table 15.6 have no zero-increment pairs, and so n = 6. (If there were two zeroincrement pairs, we would have had to use n − 2 = 4 for entering Table 15.8.) For all values of n in the table, the value of P gets smaller as T gets smaller. Because our calculated T is 5, we can say that P >
.05, regardless of whether the result receives a two-tailed interpretation, or the one tail that might be used because treatment was expected to lower cholesterol. We can therefore conclude that a stochastically distinctive (P < .05, one-tailed) lowering of cholesterol has not been demonstrated.
With only six patients in the group, Table 15.8 shows that T would have to be 0 for a two-tailed P < .05 or 2 for 2P ≤ .1. If the third patient had an after-treatment value of 211, so that the B – A increment was 223 − 211 = 12, rather than the observed −12, the value of T would have been 2, and the result would have a one-tailed P value of .05.
© 2002 by Chapman & Hall/CRC
15.7 Challenges of Descriptive Interpretation
All of the foregoing discussion of statistical “machinery” describes the operating procedures of nonparametric rank tests. You need not remember the basic strategy or the specific tactics, as the tests today are almost always done with a suitable computer program. Because the program will produce a P value for stochastic significance, your main challenge will be to make decisions about quantitative significance.
The latter task will be difficult, however, because ordinal data do not have precise central indexes that can be quantitatively contrasted, and the non-parametric rank tests offer P values, but not confidence intervals.
15.7.1Disadvantages of Customary Approaches
In the absence of a simple, obvious central index for each ordinal group, the contrasts have been done with several approaches, cited in the next three subsections, that are not fully satisfactory.
15.7.1.1 Dichotomous Compression — As noted earlier (see Section 3.5.1.3), ordinal data can be compressed dichotomously and then summarized as a binary proportion. For example, suppose we want to compare groups A and B, which have the following frequency counts in a five-category ordinal scale:
|
Much Worse |
Worse |
Same |
Better |
Much Better |
|
|
|
|
|
|
Group A |
6 |
10 |
34 |
27 |
19 |
Group B |
37 |
9 |
4 |
36 |
10 |
|
|
|
|
|
|
The better and much better results could be combined to form a “central index” for patients who became at least better. The result would be (27 + 19)/(6 + 10 + 34 + 27 + 19) = 46/96 = .48 for group A and 46/96 = .48 for group B. Although the two sets of data have quite different distributions, they would have the same binary proportion for the rating of at least better.
15.7.1.2Median Value — The central index of a set of ordinal data could also be cited according to the median value, but this tactic also has the disadvantage of losing distinctions in the rankings. Thus, the foregoing data for Groups A and B have same as an identical median value in each group.
15.7.1.3Dimensional Conversions — If the Wilcoxon approach “loses” data, an opposite
approach can gain “pseudo-data” if we convert the ordinal grades to dimensional codes. Thus, if we assign coded digits of 1 = much worse, 2 = worse, 3 = same, 4 = better, and 5 = much better, and if we
then use the digits as though they were dimensional values, an arithmetical mean can be calculated for the ordinal ratings. With this type of coding, the means in the foregoing example would be [(1 × 6) +
(2 × |
10) + (3 × 34) + (4 × 27) + (5 × 19)]/96 = 3.45 in Group A and [(1 × 37) + (2 × 9) + (3 × 4) + (4 × 36) + |
(5 × |
10)]/96 = 2.72 in Group B. |
The two groups could now be distinguished as different, but only via the somewhat “shady” tactic of giving arbitrary dimensional values to data that are merely ranked categories. Thus, the customary binary proportions, medians, and means may not offer a fully satisfactory approach for identifying desirable contrasts of central indexes in ordinal data.
15.7.2Additional Approaches
The non-parametric rank tests were originally developed without attention to an index of descriptive contrast. Wilcoxon had no need for one because he could immediately compare the observed dimensional
© 2002 by Chapman & Hall/CRC
means; he used the non-parametric rank test only for its stochastic simplicity in providing P values. When non-parametric rank tests received further development in the academic world of mathematical statistics, the process was aimed solely at a stochastic, not descriptive, contrast.
Accordingly, we need something that is both legitimately ordinal and also applicable as an index of contrast for two ordinal groups. The next few subsections describe the available options.
15.7.3Comparison of Mean Ranks
From the data presented in Section 15.5.2, we can determine the mean ranks as 2093.5 /46 = 45.5 in the placebo group and 3366.5/58 = 58.0 in the actively treated group. These two mean values are “legitimately ordinal” because they emerged from the ranks of the data. No arbitrary dimensional values such as 1, 2, 3, ... were assigned to any of the ordinal categories such as worse, no change.
Because these ranks have no associated units, we could contrast the two mean rank values as a ratio, 58.0/45.5 = 1.27. The result would be impressive if you regard the ratio of 1.27 as quantitatively significant. If you are willing to accept the average-rank values of 5.5, 19, 51.5, and 90 as being analogous to dimensional data, however, we can also calculate a standardized increment. In the placebo group, the group variance will be 8(5.5)2 + 9(19)2 + 19(51.5)2 + 10(90)2 − [(2093.5)2/46] = 39606.745, and the standard deviation is
39606.745/45 = 29.67. In the actively treated group, the counterpart results are 2(5.5)2 + 8(19)2 + 29(51.5)2 + 19(90)2 − [(3366.5)2/58] = 38361.643 for group variance and 
38361.643 ⁄57 = 25.94 for standard deviation. The pooled standard deviation is
(39606.745 + 38361.643 ) ⁄102 = 27.6 and the standard increment (or effect size) will be (58.0 − 45.5)/27.6 = .45, which could be regarded as modest (according to the criteria in Chapter 10).
15.7.4Ridit Analysis
The technique of ridit analysis, proposed by Bross7 in 1958, is analogous to the tactic just described for assigning values to the ranks. A ridit, which is an acronym for “relative to an identified distribution,” is calculated for each of a set of ordinal categories and represents “the proportion of all subjects from the reference group falling in the lower ranking categories plus half the proportion falling in the given category.” The reference group is usually the total population in the study. After the ridits are determined for each category, the mean ridits can be calculated to compare the two groups. The ridit approach is illustrated here so you can see how it works, but you need not remember it, because (for reasons cited later) it is hardly ever used today.
15.7.4.1 Illustration of Procedure — For the data in Table 15.5, the worse category occupies 10/104 = .096 of the total group, and will have a ridit of .096/2 = .048. For the no change category, which has a proportion of 17/104 = .163 in the data, the ridit will be (.163/2) + .096 = .178. For the improved category, with proportion 48/104 = .462, the ridit will be (.462/2) + .096 + .163 = .490. Finally, the much improved category, with proportion 29/104 = .279, will have (.279/2) + .096 +
.163 + .462 = .860. Because the mean ridit in the reference group is always .5, a check that the ridits have been correctly calculated for Table 15.5 shows (10 × .048) + (17 × .178) + (48 × .490) + (29 × .860) = 51.97 as the sum of the ridits and their mean is 51.97/104 = .5.
Having established the ridit values, we can calculate the mean ridit of the active-agent group as [(2 × .048) + (8 × .178) + (29 × .490) + (19 × .860)]/58 = .553. The corresponding mean ridit in the placebo group is [(8 × .048) + (9 × .178) + (19 × .490) + (10 × .860)]/46 = .433. Their ratio, .533/.433, is 1.23. (Recall, from Section 15.7.3, that the corresponding mean ranks were 58.0 and 45.5, with a ratio of 1.27.) Interpreted as probabilities, the ridit results suggest that an actively treated person has a chance of .553 of getting a result that is better than someone in the placebo group.
Standard errors can be calculated for ridits, and the results can be stochastically examined with a Z test. The textbook by Fleiss8 shows a simple formula for calculating the standard error of the difference in two mean ridits, r2 and r1 . The value of Z is then determined as ( r2 – r1 )/SED.
© 2002 by Chapman & Hall/CRC
15.7.4.2Similarity to Mean-Rank Procedure — The ridit values are reasonably similar to what would have been found if the mean ranks in each category had been expressed as percentiles. When divided by 104 (the total group size), the mean rank of 5.5 becomes a percentile of .052, 19 becomes .183, 51.5 becomes .495, and 90 becomes .865. The respective ridits that correspond to these percentiles are .048, .178, .490, and .860.
The mean average ranks in the two groups, 45.5 vs. 58.0, could each be divided by 104 to become “percentiles” of .438 vs. .558. The corresponding ridit means are .433 vs. .533. The respective ratios are 1.27 for the mean ranks and 1.23 for the ridits.
15.7.4.3Current Status of Ridit Analysis — After being initially proposed, ridit analysis
was used in investigations of the epidemiology of breast cancer 9, classifications of blood pressure,10 life stress and psychological well being,11 economic status and lung cancer,12 and geographic distinctions in schizophrenia.13 The ridit method seems to have gone out of favor in recent years, however, perhaps because of a vigorous attack by Mantel14 in 1979. He complained that the ridit procedure, although
intended “for descriptive purposes,” was unsatisfactory for both description and inference. Descriptively, the ordinal ridits emerge arbitrarily from frequency counts in the data, not from a biological or logical scheme of scaling; and inferentially, the formulation of variance for ridits has many “improprieties.”
Because stochastic decisions for two groups of ordinal data can be done with a Wilcoxon-Mann- Whitney U test and the mean rank for each group produces essentially the same result as the mean ridit, the ridit technique offers no real advantages. It is mentioned here so you will have heard of it in case it is suggested to you or (more unlikely) encountered in a published report.
15.7.5U Score for Pairs of Comparisons
Another descriptive approach for categorical frequencies in two groups was proposed by Moses et al.15 For nA and nB members in the two groups, each item of data in Group A can be compared with each item in Group B to produce nA × n B pairs of comparisons. Under the null hypotheses, half of these comparisons should favor Group A and half should favor Group B. When we check the comparisons, we count those favoring Group A, those favoring Group B, and the ties. Their sum should be nA × n B.
When this tactic is applied to Table 15.5, there are 16 tied pairs (= 8 × 2) at the rank of worse, 72 (= 9 × 8) tied pairs at no change, 551 (= 19 × 29) at improved, and 290 (= 10 × 29) at much improved, yielding a total of 829 tied pairs. The number of pairs in which active agent scored better than placebo is 8(8 + 29 + 19) + 9(29 + 19) + 19(19) = 1241. The number of pairs in which placebo scored better than active agent is 2(9 + 19 + 10) + 8(19 + 10) + 29(10) = 598. The total of (58)(46) = 2668 pairs can thus be divided into 1241 that favor active agent, 598 that favor placebo and 829 ties. If we credit half the ties to placebo and half to active agent, the scores become 1241 + (829/2) = 1655.5 for active agent and 598 + (829/2) = 1012.5 for placebo.
These numbers should look familiar. They are the values of U calculated for the active and placebo groups in Section 15.5.3. Accordingly, we could have avoided all of the foregoing calculations for pairs of ties, etc. by using the relatively simple Mann-Whitney formulas for determining the two values of U. Conceptually, this result also lets us know that the “sequential placement” strategy of counting A- before-B, etc. is the exact counterpart of determining the just-described allocation of paired ratings.
UA or UB can be used as a descriptive index for expressing a group’s proportion of the total number of compared pairs. Thus, the proportionate score in favor of active treatment is 1655.5/2668 = .62. According to Moses et al.,15 this index suggests that .62 is the random chance of getting a better response to active treatment than to placebo in the observed trial. This descriptive attribute of the U index may be more valuable than its role as a stochastic index, because the same stochastic results are obtained with the rank sum and U tests, but the rank sum procedure does not provide a descriptive index.
15.7.6Traditional Dichotomous Index
Although all of the tactics just described are useful, none of them is familiar, and we have no “intuitive feeling” about how to interpret them. Probably the easiest and simplest “intuitive” approach to Table 15.5,
© 2002 by Chapman & Hall/CRC
therefore, is to dichotomize the outcome variable and report the percentages of any improvement as (29 + 19)/58 = 82.8% for the active group and (19 + 10)/46 = 63.0% for placebo. The increment of .198 also produces an impressive NNE of 5.
Although dichotomous compression was unsatisfactory for the example in Section 15.7.1.1, the binary “cut point” was not well chosen in that example. If the binary split had been demarcated to separate the ratings of worse and much worse (rather than the rating of at least better), the proportions of patients who were worse or much worse would have been 16/96 = .17 for Group A and 46/96 = .48 for Group B in that example. A clear distinction in the two groups would then be evident.
In general, whenever two groups of ordinal grades have an evident distinction, it can usually be shown quantitatively with a suitable dichotomous split and cited with binary proportions that are easy to understand and interpret.
15.7.7Indexes of Association
Yet another way of describing the quantitative contrast of two groups of ordinal data is available from indexes of association and trend that will be discussed much later in Chapter 27.
One index, called Kendall’s tau, relies on calculations such as U (in Section 15.4.1) that determine and score the sequence of placement for corresponding ranks. A second descriptive index can be the slope of the straight line that is fitted to the ordinal categories by regression-like methods.
A crude, quick descriptive idea of the association, however, can be obtained by forming a ratio between the observed increment in the sums of ranks and the “no expected” (rather than perfect) difference. For example, in Table 15.5, the sum of ranks would be (104)(105) /2 = 5460. If the two groups have equal values in ranks, their sums should be 5460/2 = 2730 in each group. The departure of each group’s rank sum from this value is 3366.5 − 2730 = 636.5 for the active group, and 2093.5 – 2730 = −636.5 for the placebo group. The value of 636.5 /2730 = .23 indicates how greatly the sums proportionately deviate from an equal separation. Because the increment in rank sums is 3366.5 – 2093.5 = 1273 and 636.5/2730 = (1273/2)/(5460/2), the desired ratio can be obtained simply as
(increment in rank sums)/(total sum of ranks)
The value of .23 here is not far from the Kendall’s tau value obtained for this relationship in Chapter 27.
15.8 Role in Testing Non-Gaussian Distributions
A non-parametric rank test can be particularly valuable for demonstrating stochastic significance in situations where dimensional data do not have a Gaussian distribution. Because the t and Z tests are generally believed to be robust, the parametric tests are almost always used for contrasting two groups of dimensional data. Nevertheless, the non-parametric strategy can be a valuable or preferable alternative in non-Gaussian distributions where the group variance is greatly enlarged by outliers. Consider the dimensional values in the following two sets of data:
Group A: 15, 16, 16, 16, 17, 18, 19, 70; XA = 23.375; sA = 18.88.
Group B: 21, 23, 25, 27, 29, 32, 33, 37; XB = 28.375; sB = 5.42.
The mean in Group B seems substantially higher than the mean in Group A, but the large variance in Group A prevents the t test from being stochastically significant. With s2p = 7(18.882 + 5.422)/14 = 192.92, the calculation of t produces (28.375 − 23.375)/[
192.92 
(1 ⁄7) + (1 ⁄7) ] = 5.00/7.42 = .67, which is too small to be stochastically significant.
On the other hand, the strikingly high variance in Group A suggests that something peculiar is happening. In fact, when we examine the data more closely, all values in Group A, except for the last
© 2002 by Chapman & Hall/CRC
(outlier) member, are exceeded by the lowest value in Group B. When converted to ranks, the results are as follows:
Ranks for Group A: 1, 3, 3, 3, 5, 6, 7, 16; Sum = 44.
Ranks for Group B: 8, 9, 10, 11, 12, 13, 14, 15; Sum = 92.
With the lower sum serving as the basic component of U, we get U = 44 − [(8 × 9)/2] = 8. For 2P <
.05 at n1 = n2 = 8 in Table 15.4, U must be ≤ 13. Th erefore, the rank test shows the stochastic significance that was not obtained with the parametric t test. (If this situation seems familiar, it should be. An example of analogous data appeared as Exercise 11.1, and the example will surface again here as Exercise 15.3.)
15.9 Additional Comments
Two other important features to be noted about non-parametric rank tests are claims about the reduced “powerefficiency” of the tests, and the availability of other procedures beyond those that have been discussed.
15.9.1“Power-Efficiency” of Rank Tests
Non-parametric tests were seldom used until Wilcoxon first began writing about them. Because he converted dimensional data to ranks, many complaints arose about the “loss of information,” even though — as noted in Section 15.8 — the “loss” can sometimes be highly desirable.
A different complaint about rank tests was that they had a reduced “power-efficiency.” The idea of statistical power, which will receive detailed discussion in Chapter 23, refers to the ability of a stochastic test to reject the null hypothesis when a quantitative difference is specified with an alternative hypothesis. Compared with parametric procedures for dimensional data, non-parametric rank procedures were less “efficient,” requiring larger sample sizes for this special “power” of “rejection.”
The complaint is true, but is seldom relevant today because the rank tests are almost always used for ordinal data, not for conversions of dimensional data. Besides, all the calculations of “efficiency” and “power” are based on Gaussian distributions. As shown in Section 15.8, however, a rank test can sometimes be more “powerful” than a parametric test in showing stochastic significance when distribu - tions are not Gaussian.
15.9.2Additional Rank Tests
Many other tests, which rarely appear in medical literature, are also available (if you want or need to use them) for one or two groups of ordinal data. The tests are labeled with a dazzling array of eponyms that include such names as Cramer-von Mises, Jonckheere-Terpstra, Kruskal-Wallis, Kuiper, Kolmog- orov-Smirnov, Moses, Savage, Siegel-Tukey, and Van der Waerden. The tests are well described in textbooks by Bradley,16 Siegel and Castellan,17 and Sprent.18
15.9.3Confidence Intervals
Because ranked categorical data have discrete values, confidence intervals calculated with the usual mathematical methods are inappropriate because they can produce dimensional results that are realistically impossible. Nevertheless, mathematical formulas have been proposed and are available19 for calculating the confidence intervals of medians or other quantiles. As computer-intensive statistics begin to replace the traditional mathematical theories, realistic confidence intervals, if desperately desired for ordinal data, can be obtained from a bootstrap procedure or from the array of possibilities that emerge with permutation rearrangements.
© 2002 by Chapman & Hall/CRC
15.10 Applications in Medical Literature
A one-group or two-group rank test appears regularly although infrequently in medical literature. For example, the Wilcoxon-Mann-Whitney U tests were recently applied to evaluate ordinal data for “level of alertness” in a study of analgesia for the pain of sickle cell crisis20 and to compare distributions of blood manganese and magnetic-resonance-imaging pallidal index in patients with liver failure vs. controls.21 Wilcoxon Rank Sum tests were used to compare numbers of swollen joints in a clinical trial of oral collagen treatment for rheumatoid arthritis22 and to compare several ordinal-scaled baseline factors among hospitalized patients who did or did not consent to participate in a study of pressure ulcers.23 When patients with bacteremia or fungemia were compared with a control group,24 the Wilcoxon rank sum test was used for dimensional variables because they “were not normally distributed.” In a trial of topical training therapy for “photoaged skin,”25 with effects graded in a 5-category ordinal scale ranging from 0 = absent to 4 = severe, the signed-rank test was used for bilateral, paired comparisons of treated forearms, and the U test for “facial efficacy.” The signed-rank test was also used to compare differences in serum gastrin levels from one time point to another.26
In a study of cancer in the offspring of cancer survivors,27 the investigators applied an “exact version of the non-parametric rank sum test,” and in another instance a bootstrap procedure was used “to correct for loss of power of the Mann-Whitney U test due to the small sample size and tied observations.” 28 With increasing availability in “packaged” computer programs, the bootstrap and exact permutation methods may begin to replace both the traditional parametric and the conventional non-parametric rank tests. The key decision for the rank procedures will then be the choice of a particular descriptive index of contrast (such as U or one of the others cited throughout Section 15.7) to be used as a focus of the bootstraps or permutations.
A randomized permutation procedure was also used in an intriguing controversy when data were reanalyzed for a randomized double-blind, placebo-controlled crossover trial that had shown efficacy for a 10−12 dilution of Rhus toxicodendron 6c as “active” homeopathic treatment for fibrositis.29 For the crossover, 30 patients “received active treatment and an identical placebo for one month each in random sequence,” without an intervening “washout” period. In the original analysis, the mean number of “tender points” after each phase of treatment was 14.1 for placebo and 10.6 for the “active” agent. The associated P value was <0.005 with the Wilcoxon rank sum test. The reanalysis of data30 was done with “randomization tests,” however. “When the original data set was randomized 20,000 – 50,000 times,” the results showed much higher, “non-significant” P values. The reanalyst concluded that the trial “provides no firm evidence for the efficacy of homeopathic treatment of fibrositis.”
The cogent point of contention, however, was not the use of a randomization rather than rank test. Instead, the reanalyst found a “treatment period interaction,” which means that the effects of whatever occurred in the first period of treatment had “carried over” into the second. According to conventional wisdom31 for such circumstances, “the only safe procedure is to restrict attention to the first treatment period only.” The impressive P values vanished when the analysis was confined to the first period effects alone.
15.11 Simple Crude Tests
Two other tests mentioned here could have been cited in various previous chapters, but were saved for now because they can be applied to ordinal as well as dimensional data. Called the sign test and the median test, they are easy to do and easy to understand, although relatively “crude.” Nevertheless, they can sometimes promptly answer a stochastic question without resort to more complex work. They are sometimes called “quick and dirty” tests, but the “dirty” epithet is unfair. They are “dirty” only because they compress dimensional and ordinal data into binary scales. If stochastic significance is obtained at the “crude” binary level of compression, however, the more “refined” conventional tests are almost never necessary.
© 2002 by Chapman & Hall/CRC
15.11.1Sign Test
The sign test is a particularly easy way to determine stochastic significance for matched-pair data in which two groups have been reduced to one. The signs of the results are compared against the nullhypothesis expected value of equality in the positive and negative values.
For example, in Table 15.6, under the null hypothesis for the six comparisons, we would expect three to be positive and three negative. A P value for the observed results can then be calculated from the binomial expansion of the hypothesis that π = .5. Thus, if all 6 observations were positive, the one-tailed probability would be (0.5)6 = (1/2)6 = 1/64 = .016. The probability of getting exactly five positive differences and one negative would be 5(0.5)5 (0.5) = 5(1/2)6 = 5/64 = .078. For getting 5 or more differences in a positive direction, the one-tailed P value would be .016 + .078 = .094. As this value already exceeds P = .05, the observed result (4 positives and 2 negatives) will probably not be stochas - tically significant.
The sign test is particularly useful as a rapid mental way of calculating stochastic significance when the paired results go exclusively in one direction. For example, if all results go in the expected direction for five matched pairs, the one-tailed P value will be (1/2)5 = 1/32 = .03. In this circumstance, stochastic significance could be promptly declared without any further calculations. (The tactic was used earlier in the answer to Exercise 13.6.2.)
15.11.2Median Test
In another simple stochastic procedure, the data are divided at the median value for the two groups. The arrangement forms a 2 × 2 table, which can then be tested appropriately with either the Fisher test or chi-square.
For example, in the congestive-heart-failure illustration of Table 15.5, the median value of the ratings is improved. Partitioned on one side of this median, the data form a 2 × 2 table as
|
Below |
Improved or |
|
|
|
|
Improved |
Much Improved |
TOTAL |
||
|
|
|
|
|
|
|
Placebo |
17 |
29 |
46 |
|
|
Active Agent |
10 |
48 |
58 |
|
|
TOTAL |
27 |
77 |
104 |
|
|
|
|
|
|
|
The X2 test for this table produces [(17 × |
48) − (10 × |
29)]2104/[(27)(77)(58)(46)] = 5.2, with 2P < .025. |
|||
Thus, the stochastic significance of this set of data could have been promptly demonstrated with a simple median test, avoiding the more complex Wilcoxon-Mann-Whitney U test.
In the days before easy electronic calculation (and sometimes even today), the median test could also be used for two groups of dimensional data. For example, consider the dimensional results shown for Group B and the eccentrically distributed Group A in Section 15.8. The median for the total of 16 items is at rank 8 1/2, which is between the values of 21 and 23, at 22. If the two groups are partitioned at the median, the frequency counts produce the following table:
|
Number of Items |
|
|
|
Below Median |
Above Median |
TOTAL |
|
|
|
|
Group A |
7 |
1 |
8 |
Group B |
1 |
7 |
8 |
TOTAL |
8 |
8 |
16 |
|
|
|
|
A crude, uncorrected, and inappropriate chi-square test for these data produces X2 = 9.0, with 2P < .005. A more appropriate Fisher Test yields p = .005 for the observed table and an even smaller value, p =
.00008, for 80 08 . The two-tailed P will be about .01. Thus, stochastic significance for these data could
© 2002 by Chapman & Hall/CRC
