Ординатура / Офтальмология / Английские материалы / Principles Of Medical Statistics_Feinstein_2002
.pdf
[27.17], the slope of the gradient is {104[(1 × 8) + (2 × 9) + (3 |
× 19) + (4 × 10)] |
– 46[(10 × |
1) + |
(17 × 2) + (48 × 3) + (29 × 4)]}/{104[(10 × 1) + (17 × 4) + (48 × 9) |
+ (29 × 16)] – [(10 |
× 1) + (17 × |
2) + |
(48 × 3) + (29 × 4)] 2} = (12792 − 13984)/(101296 − 92416) = −1192/8880 = –.134. The precise linear trend, showing a decline of 13.4% per category, is similar to the crude gradient of 15.3%. This is a quantitatively impressive trend, and it is also stochastically significant, as shown in Section 27.5.6.2.
The main problem with the foregoing arrangement is that it transposes the basic variables, so that the outcome category (improvement) becomes regarded as though it were the “independent” variable, and the preceding binary treatment becomes the “dependent” variable. Despite the transposed variables, however, the quantitative evaluation of trend is easy to understand and has been vigorously preferred28,29 over the customary statistical method30 (discussed in Chapter 15) for analyzing ordinal data.
X 2L — The distinction between X2, for any array of categorical data, vs. X2L , for an ordinal array, should be kept in mind when stochastic tests are done for an r × 2 table. For example, the linear trend in proportions was used in Section 27.5.6.1 to provide a descriptive summary of the ordinal gradings in Table 15.5. When this result is tested stochastically, the Wilcoxon–Mann–Whitney U procedure takes account of the ordinal arrangement, but an ordinary X2 test does not. Thus, the X2L test should be used if stochastic significance is to be appraised for Table 15.5. In those results, the overall X2 is 7.247, which has 2P = .06 at 3 degrees of freedom. The linear regression of the data, however, shows r2 = .0600 and X2L = Nr2 = (104)(.0600) = 6.24, which has 2P < .025 at 1 degree of freedom. This result is more consistent with the Wilcoxon–Mann–Whitney test for the same data in Section 15.5.4, where Z was −2.25, for which 2P = .024.
27.5.6.3Appraisal of Ordinal Staging Systems — With increasing attention being given to factors that affect etiologic “risk” for disease or prognostic outcome of disease, individual variables or groups of variables are regularly arranged into the ordinal categories of a “risk stratification” or prognostic “staging” system, such as Table 27.5. When several contender candidate variables are available, the linear trend of the gradient in the outcome event can be examined to help choose a “best” candidate.
27.5.6.4“Significance” for Ordinal Partitions — The categorical variables used in a prognostic staging system can be constructed from an ordinal partition of dimensional data (such as dividing
age into categories of <20, 21–49, 50–69, and 70+) or from an ordinal array (such as TNM stages I, II, III, …). After the variable has been constructed, stochastic significance may not always occur in results for adjacent categories, even though the gradient shows a distinct linear trend. Rather than compressing the categories to form larger numbers of members in a small number of groups, the analyst may be content to let the original partition remain if it shows stochastic significance in the X2L test for linear trend.
27.5.6.5 Screening for “Double Gradients” — The conjunctive-consolidation procedure14,15 offers an easily understood multivariable-analytic method to construct prognostic staging systems. The method begins by screening for impressive “double gradients” in both the rows and columns of each conjunctive 2 × r × c table that becomes “consolidated.” An inspection of linear trend in the outcome rates for cells in the r rows and c columns is a useful way to do the screening.
27.5.7Additional Approaches
Throughout the foregoing discussion, survival was examined as the “response” to a change in the ordinal category of severity. This type of “dose-response” phenomenon frequently occurs in clinical and epide - miologic literature. The ordinal “dose” can often be levels of a pharmaceutical agent or a risk factor.
© 2002 by Chapman & Hall/CRC
Problems can arise if the ordinal categories do not ascend in equi-interval dimensions that justify a linear coding, and if the responses do not show a linear relationship.Various alternative approaches, 31–33 including spline regression, have been proposed for dealing with these problems. Nevertheless, the lineartrend method has the advantage of being a useful “screening” procedure that is relatively simple and easy to understand (despite the length of the previous discussion).
27.6 Indexes of Ordinal Contingency
In addition to tau-b, other indexes for analyzing trend in the graded categories of two-way ordinal contingency tables are called gamma, Somers D, and tau-c. These indexes all use the basic sequentialplacement scoring system described in Section 27.3.1.1 for the co-related pattern of ranks.
27.6.1Scores for Co-Relationship of Tabular Grades
The scoring system discussed in Section 27.3.1.1 for ordinal co-relationships is not too difficult to understand, but can be tricky to apply for tabular grades. To illustrate the basic strategy, suppose we have a 3 × 3 ordinal contingency table, with cells designated as a, b, … , h, i as shown in Table 27.9, which is oriented so that the X variable advances downward in the rows and the Y variable goes rightward in the columns.
TABLE 27.9
Contingency Table with Two Ordinal Variables
|
|
|
Variable Y |
|
|
|
(1) |
(2) |
(3) |
Variable X |
Low |
Medium |
High |
|
|
|
|
|
|
(1) |
Low |
a |
b |
c |
(2) |
Medium |
d |
e |
f |
(3) |
High |
g |
h |
i |
|
|
|
|
|
The cells are compared and scored as we advance |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
through the table, going across each row, down to the |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
next row, and then across that row. The ranks for each |
|
a |
|
|
|
|
b |
|
|
|
|
|
|
|
|
c |
|
|||
|
|
|
|
|
|
|
|
|
||||||||||||
pair of cells will be either tied, concordant, or discor- |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
dant. Any two cells in the same row have tied ranks |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
on the row variable, and any two cells in the same |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
column are tied on the column variable. Between any |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
two cells, the trend is concordant if the second cell |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
has higher ranks in both the row and column variable. |
|
d |
|
|
|
|
e |
|
|
|
|
|
|
|
|
f |
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
The trend is discordant if the second cell has a lower |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
rank in the column variable while having a higher rank |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
in the row variable, or vice versa. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
Each score for concordances (marked P) and dis- |
|
d |
|
|
d |
|
|
|
|
|
|
d |
|
|
|
|
|
|||
cordances (marked Q), and for ties, is the product of |
|
|
|
|
|
|
|
|
|
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
g |
|
|
|
|
h |
|
|
|
|
|
|
|
|
i |
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||
frequency counts in the two compared cells. The pat - |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
tern of scoring for Table |
27.9 is shown in Figure 27.3 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
and the scores are |
entered in the display of |
|
|
|
|
Cell already considered |
|
|
|
|
|
|
|
|||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
Table 27.10. |
|
|
|
|
|
Ranks tied on row variable (X) |
|
|||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
To illustrate the process, suppose we start with the |
|
|
|
|
Ranks tied on column variable (T) |
|
||||||||||||||
|
|
|
|
|
||||||||||||||||
a members of cell (1,1). The X variable is tied in cells |
|
|
|
|
Concordant ranks |
|
|
|
|
|
|
|
||||||||
|
|
|
|
|
|
|
|
|
|
|
||||||||||
(1,2) and (1,3) for members b and c, and the Y variable |
|
|
|
|
Discordant ranks |
|
|
|
|
|
|
|
||||||||
|
|
|
|
|
|
|
|
|
|
|
||||||||||
is tied in members d and g. In all other cells, i.e., e, |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
f, h, and i, the ranks are higher for both variables. We thus enter the “scores” for cell a into Table 27.10 as P = a(e + f + h + i), Q = 0; ties on X for a(b + c) and ties on Y for a(d + g). Going right and down from the
b members of cell (1,2), there are c ties on the X variable and (e + h) ties on the Y variable. Cells (2,1)
© 2002 by Chapman & Hall/CRC
and (3,1) have discordant values because the rank of variable Y is lower than 2, and cells (2,3) and (3,3) have concordant values because the rank of Y is higher. Thus, as shown in Table 27.10, the scores for cell (1,2) are b(f + i) for P, b(d + g) for Q, b(c) for ties in X and b(e + h) for ties in Y.
In the next step, nothing lies to the right of cell (1,3), but two downward cells are tied in the Y rank, and four cells in the next two downward rows are discordant because of lower values in the Y rank as X increases. Thus, the entries for cell (1,3) in Table 27.10 are P = 0, Q = c(d + e + g + h), X-ties = 0, and Y-ties = c(f + i). The entries for the remaining cells are shown in the rest of Table 27.10.
The scoring system can be pragmatically illustrated for Table 27.11, which has 9 cells showing hypothetical data of 150 hospitalized patients’ ratings for satisfaction with care in relation to clinicians’ ratings for severity of illness. The letters in parentheses of Table 27.11 correspond to the letters of Table 27.9. Table 27.12 shows the corresponding scores for P, Q, and ties. The total scores are 1339 for P, 3775 for Q, 2032 for ties in X, and 2347 for ties in Y.
The scores for P, Q, and ties in Table 27.12 become the constituents of the indexes called gamma, Somers D, tau-b, and tau-c, which differ only in the way they use these scores.
TABLE 27.10
Minding the Ps and Qs for Table 27.9
|
|
|
|
Score for Ties |
|
Location |
Cell |
Score for P |
Score for Q |
Variable X |
Variable Y |
|
|
|
|
|
|
(1, 1) |
a |
a(e + f + h + i) |
0 |
a(b + c) |
a(d + g) |
(1, 2) |
b |
b(f + i) |
b(d + g) |
b(c) |
b(e + h) |
(1, 3) |
c |
0 |
c(d + e + g + h) |
0 |
c(f + i) |
(2, 1) |
d |
d(h + i) |
0 |
d(e + f) |
d(g) |
(2, 2) |
e |
e(i) |
e(g) |
e(f) |
e(h) |
(2, 3) |
f |
0 |
f(g + h) |
0 |
f(i) |
(3, 1) |
g |
0 |
0 |
g(h + i) |
0 |
(3, 2) |
h |
0 |
0 |
h(i) |
0 |
(3, 3) |
i |
0 |
0 |
0 |
0 |
|
|
|
|
|
|
TABLE 27.11
Hypothetical Data for Relationship of Patient Satisfaction and Severity of Illness
Clinician’s Ratings |
Patients’ Ratings of Satisfaction with Care |
|
||
of Severity of Illness |
Low |
Medium |
High |
Total |
|
|
|
|
|
Mild |
8(a) |
12(b) |
25(c) |
45 |
Moderate |
17(d) |
13(e) |
18(f) |
48 |
Severe |
32(g) |
15(h) |
10(i) |
57 |
Total |
57 |
40 |
53 |
150 |
Note: Letters in parentheses correspond to the cells listed in Table 27.8.
27.6.2 Gamma
The gamma (or Goodman and Kruskal’s gamma)10–12 index is constructed in essentially the same way as tau-a and tau-b, except that ties are ignored. The formula is
gamma = -P-----–----Q--- |
[27.24] |
P + Q |
|
© 2002 by Chapman & Hall/CRC
TABLE 27.12
Scores for P, Q, and Ties in Table 27.11
|
|
|
|
Score for Ties |
|
Location |
Cell |
Score for P |
Score for Q |
Variable X |
Variable Y |
|
|
|
|
|
|
(1, 1) |
a |
8(13 + 18 + 15 + 10) |
0 |
8(12 + 25) |
8(17 + 32) |
(1, 2) |
b |
12(18 + 10) |
12(17 + 32) |
12(25) |
12(13 + 15) |
(1, 3) |
c |
0 |
25(17 + 13 + 32 + 15) |
0 |
25(18 + 10) |
(2, 1) |
d |
17(15 + 10) |
0 |
17(13 + 18) |
17(32) |
(2, 2) |
e |
13(10) |
13(32) |
13(18) |
13(15) |
(2, 3) |
f |
0 |
18(32 + 15) |
0 |
18(10) |
(3, 1) |
g |
0 |
0 |
32(15 + 10) |
0 |
(3, 2) |
h |
0 |
0 |
15(10) |
0 |
(3, 3) |
i |
0 |
0 |
0 |
0 |
|
Totals |
1339 |
3775 |
2307 |
2347 |
|
|
|
|
|
|
For the data in Tables 27.11 and 27.12, gamma will be (1339 − 3775)/(1339 + 3775) = –2436/5114 = −.48. The moderately strong negative value, indicating an inverse trend, is consistent with the hypothesis that satisfaction with care goes down as severity of illness goes up.
27.6.3Somers D
Somers D resembles gamma, but is asymmetric and takes ties into account. The asymmetry is expressed in two indexes. One of them, Dyx, accounts for the number of pairs tied on variable Y, but not on X. The expression is
Dyx = (P − Q)/(P + Q + Ty) |
[27.25] |
where Ty is the score for the pairs tied on Y but not on X. The other index, Dxy, accounts for ties on variable X, but not on Y. The expression is
Dxy = (P − Q)/(P + Q + Tx) |
[27.26] |
where Tx is the score for the corresponding tied pairs.
For the data uuder discussion, Ty = 2347 and Tx = 2307. The value of Dyx = −2436/(5114 + 2347) = −.326 and Dxy = −2436/(5114 + 2307) = −.328.
In this instance, because of the belief that satisfaction (the Y variable) depends on severity of illness (the X variable), the appropriate index is Dyx. Its value, somewhat analogous to a regression coefficient, is smaller than gamma because of the ties that enlarge the denominator under P− Q. (If someone believed that dissatisfied people are particularly likely to become or be made seriously ill, the appropriate index would Dxy. In this example, but not always, the values for Dyx and Dxy happen to be quite similar.)
27.6.4Tau-b
After the values of P, Q, Tx , and Ty are obtained, tau-b can be calculated for contingency tables, using either Formula [27.3] or the alternative
tau-b = (P – Q) ⁄ |
(P + Q + Ty )(P + Q + Tx ) |
[27.27] |
which, in the cited example, becomes – 2436 / |
(7461)(7421 ) = –.33. |
|
© 2002 by Chapman & Hall/CRC
In Formula [27.27], the structure of tau-b uses the two Somers indexes in a way resembling the calculation of the Pearson correlation coefficient, r, as the geometric mean of the two slopes b and b′. Thus, r = 
bb′ and tau-b =
Dyx Dx y. Accordingly, tau-b could have been calculated from Dyx and Dxy as
(–.326 )(–.328 ) = –.33.
27.6.5Tau-c
The calculation of tau-b requires a “square” table, with equal numbers of rows and columns. If the table is rectangular, rather than square, the value of tau is corrected to
tau-c = 2(P − Q)/{(N2)[(m − 1)/m]} |
[27.28] |
where N is the total number of members in the entire table, and m is the smaller value of either the number of rows or the number of columns. Formula [27.28] should not be used for square tables, such as Table 27.11, but if applied, the result would be
tau-c = 2(−2436)/{(150)2[(3 − 1)/31} = 4872/15000 = −.325
which is almost the same as the value obtained for tau-b, while being much simpler to compute. Gamma and Somers D do not require an analogous correction for non-square tables.
27.6.6Lambda
Table 27.11 could also be summarized with the lambda index discussed in Sections 27.2.1 and 27.4.1.1. With 57 low ratings for satisfaction, the modal estimation would have 150 – 57 = 93 errors. Using the severity-of-illness ratings, the estimates of high satisfaction for the mild and moderate rows of illness and low satisfaction for severe illness would respectively have (45 − 25) + (48 − 18) + (57 − 32) = 75 errors. The lambda index would be (93 − 75)/93 = .19.
This result, using a method of estimation, is much lower than the range of absolute values from .32 to .48 in all of the corresponding indexes that use scores for trend — gamma, Somers D, and the two tau coefficients of association. The reason for the disparity is that lambda reflects accuracy in estimation, whereas the “score” indexes reflect trend. Lambda is thus analogous to r2, whereas the score indexes are analogous to r.
27.7 Other Indexes of Ranked Categorical Trend
Ranked categorical trends can be determined whenever one of the variables has dimensional or ordinal data that can be ranked in relation to categories of the other variable. The diverse ways of arranging variables, categories, and tables have evoked many proposals of statistical indexes. The ones that have been cited in this chapter were chosen because they seem particularly pertinent for medical literature, or because they commonly appear in printouts of computer “package” programs for indexing associations. Beyond the bi-ordinal coefficients and contingency tables just discussed, three additional indexes are briefly outlined here. The text by Freeman34 has particularly good worked examples of their application.
Jaspen’s multiserial coefficient (dimensional × ordinal) can express the association between a dimensional variable, such as weight, and an ordinal variable, such as social class. The coefficient has been used35 to cite the relationship between a dimensionally measured “scan artifact” in pelvic magnetic resonance imaging, and a four-grade ordinal scale of radiologist’s assessment of scan quality. A computer program has been constructed36 to do the calculations, but the index is seldom used, probably because many analysts ignore the mathematical impropriety, accept the ordinal grades as dimensional values, and then use conventional bi-dimensional regression methods.
In the biserial coefficient (dimensional × binary), which is a special case of the Jaspen multiserial coefficient, the ordinal scale has only two (binary) ranks. The biserial index can be pertinent when a
© 2002 by Chapman & Hall/CRC
binary variable, such as existence of survival, depends on a dimensional variable, such as age or weight. The coefficient would indicate the association in data showing the proportions of survivors at different ages or weights. If the dimensional variable is time and adequate data are available, the biserial coefficient might be used as an index of association for the survival curve. The association between binary and dimensional data could also be expressed with an index of linear trend in an r × 2 ranked array, as discussed in Section 27.5.
Freeman’s theta (ordinal × nominal) can be used when an ordinal variable, such as social class, is examined in a set of nominal groups, A, B, C,… . The theta (θ ) coefficient for this association was developed as a multi-group extension of the Wilcoxon signed-rank procedure.
27.8 Association in “Large” r × c Tables
The associations that have not yet been discussed occur in a “large” r × c table (where r and c are each ≥ 3) for two nominal variables. In such tables, which are not common, the arrangements are best called “associations,” because the idea of “trend” seems inappropriate if either or both of the two variables is nominal. To ease comprehension, many analysts try to condense the big tables into a 2 × 2 structure by “collapsing” (i.e., compressing) suitable categories.
27.8.1“Screening” with φ and X2
Before the “collapse,” the entire r × c table is often “screened” to see whether any “significant” findings are present. The most common screening procedure is to examine the φ coefficient of correlation for the two categorical variables. To determine φ , the expected values are first noted in each cell, and the value of X2 for the entire table is calculated as the sum of the [(observed − expected)2/expected] values in the cells. The value of φ , calculated as φ = 
X2 ⁄N, is then interpreted like a correlation coefficient, with values ranging from 0 to |1|.
Further examination beyond the screening can be stimulated by a stochastically significant X2 [interpreted at (r − 1)(c − 1) degrees of freedom], by a quantitatively significant value of φ , or by both. The “significant” cells in the table will be those with individually high values of the [(observed − expected)2/expected] ratios.
Other data analysts, however, may collapse the table immediately and omit the “screening” process. The collapsing is usually done according to “sensible” strategies derived from the biologic content of the data and/or from the quantitative need to combine cells with small frequencies.
27.8.2Example in Medical Literature
Large r × c tables with nominal categories seldom appear in medical literature. In one example, however, a letter37 to the editor of The New England Journal of Medicine reported a study of bimanual dexterity and batting “handedness” in baseball players. The players were divided into five groups: high school students, elementary school students, all recorded professional players, current major-league players (excluding pitchers), and best hitters of all time. The players’ handedness was divided into 6 categories based on whether each player threw right-handed or left-handed and also batted right, left, or both ways.
The 5 × 6 table was then analyzed with a chi-square procedure, which led to the conclusion that the “professional baseball players had a significantly … higher number of left handed batters than the (student) controls.”
27.8.3 φ in 2 × 2 Tables
φ is seldom applied for indexing a 2 × 2 table, because the investigator will usually want to contrast the two proportions as an increment or ratio, not with a correlation coefficient.
If the 2 × 2 table is subjected to regression analysis, however, the slope (as shown in Section 27.5.3.1) is simply the increment, p2 − p1. Because the two proportions can be compared in two directions, as
© 2002 by Chapman & Hall/CRC
a/n1 vs. c/n2 or as a/f1 vs. b/f2, two different increments can be determined. Appropriate algebraic activity will demonstrate that φ , as a correlation coefficient, is the geometric mean of these two increments.
27.8.4Alternatives to φ
An ideal correlation coefficient should be able to range from values of 0 to |1|. In a 2 × 2 table, the maximum value that X2 can attain is N; and so the largest possible value for φ 2 will be N/N = 1. (This situation occurs when diagonally opposite cells such as a, d or b, c are both empty in a table structured
as |
a b |
.) In a larger r × c table, however, X2 can exceed the value of N, and if so, φ 2 will have an |
|
c d |
|
inappropriate value that exceeds 1. To avoid this problem, three alternative indexes are available — all of them constructed as modifications of X2/N.
27.8.4.1Pearson’s C — Pearson’s “contingency coefficient,” C, is calculated as
C2 = X2/(X2 + N)
With this structure, C2 will always be <1. When X2 = N, C2 will be 1/2. C is seldom used for 2 × 2 tables because its maximum value will be 
1 ⁄2 = .71 and because it does not have a good “intuitive” meaning for larger tables.
27.8.4.2 Tschuprow’s T — To eliminate the possibility of getting a correlation coefficient that exceeds 1, Tschuprow’s T2 divides φ 2 by 
(r – 1)(c – 1 ). The formula is
T2 = X2 / N 
(r – 1 )(c – 1 )
The square root of this value is Tschuprow’s T. In a 2 × 2 table, (r − 1) (c − 1) = 1, and so the result will be identical to φ . For “large” tables, (r − 1) (c − 1) will exceed 1, and its square root division will lower the value of φ .
27.8.4.3 Cramer’s V — Tschuprow’s T has the disadvantage of producing an overcorrection. Although kept from being >1, the correlation coefficient often cannot reach the permissible maximum value of 1. This problem is particularly likely to occur if r is much greater than c (or vice versa) in a large r × c table.
To eliminate this problem, Cramer divided X2/N by an entity called Min (r − 1, c − 1), which represents the smaller value of either r − 1 or c − 1. The formula is
V2 = X2/[N × Min (r − 1, c − 1)]
Thus, if r = 3 and c = 4, the smaller value is r − 1 = 2, and V2 = X2/2N. When r = c, Tschuprow’s T and Cramer’s V are identical. When either r or c is 2, V and φ will be identical.
This attribute of Cramer’s V makes it the preferred index of association for large r × c tables. The index almost never appears in medical literature, but was used many years ago in an analysis where the investigators screened for correlations of 7 categorical variables in a study of limb sarcomas.38 The number of categories in the variables ranged from two to four; and Cramer’s V was used as the index of association because “it adjusts for … degrees of freedom … in a convenient manner, so that tables with differing degrees of freedom may be compared.”
© 2002 by Chapman & Hall/CRC
References
1. Siegel, 1988; 2. Bradley, 1968; 3. Ferguson, 1965; 4. Edgington, 1969; 5. Kendall, 1990; 6. Tate, 1957; 7. Conover, 1980; 8. Sprent, 1993; 9. Guttman, 1941; 10. Goodman, 1954; 11. Kohout, 1974; 12. Everitt, 1977; 13. Makuch, 1989; 14. Feinstein, 1990b; 15. Feinstein, 1996; 16. Sawada, 1992; 17. Simon, 1978; 18. Kendall, 1955; 19. Griffiths, 1980; 20. Noether, 1981; 21. Greene, 1981; 22. Calandra, 1991; 23. Ramirez, 1992; 24. Sabia, 1992; 25. Robbins, 1993; 26. Follman, 1992; 27. Armitage, 1987; 28. Poole, 1984; 29. Detsky, 1984; 30. Moses, 1984; 31. Boucher, 1998; 32. Liu, 1998; 33. Kallen, 1999; 34. Freeman, 1965; 35. Wright, 1992; 36. Cicchetti, 1982; 37. McLean, 1982; 38. Freeman, 1980.
Exercises
27.1.These questions all refer to Figure 27.2 in the text. The two verbatim descriptions here are from pertinent sections of the research report.
Methods: “Measures of association were calculated using either simple linear regression or the Spearman rank correlation coefficient, when appropriate. All reported significance levels are two-sided.” Results: “In the 33 nonsurvivors, as well as in the subgroup of 21 patients who died of irreversible septic
shock, the serum concentrations of IL-6 measured at study entry correlated inversely with the duration of survival (r = −0.51, p = 0.004 and r = −0.62, p = 0.005, respectively; Spearman rank correlation
coefficient). In particular, of nine patients who died of irreversible septic shock within the first 24 hours of study, all but one had a very high concentration of IL-6 (median: 64 ng/mL, range: 0.26 to 305 ng/mL) at study entry.”
27.1.1.Because the Spearman coefficient is for correlation, not regression, how do you think the apparent regression line was determined? Do you agree with the result? If not, why not?
27.1.2.Do you like the way the graph is arranged? If not, what would you prefer?
27.1.3.Does the graph have any obvious errors or inconsistencies? If so, what are they?
27.2.In a study of diagnostic fashions among psychiatrists, the diagnoses were compared for two groups
of 145 patients each in a London and a New York hospital. In London, 51 patients were diagnosed as schizophrenic and 67 as affectively ill. The corresponding numbers of diagnoses in New York were, respectively, 82 and 24. The remaining diagnoses at each institution were classified as miscellaneous.
27.2.1.What would you suggest as the best statistical index to summarize this set of results?
27.2.2.To express concordance for these data, how should kappa be weighted?
27.3.In a study of prognostic stratification, a cohort with 5-year survival rate of 85/221 = 38%
could be divided into two different staging systems having the following results:
Stage |
System A |
System B |
|
|
|
|
|
I |
70/122 (57%) |
36/50 (72%) |
|
II |
12/46 |
(26%) |
40/85 (47%) |
III |
3/53 |
(6%) |
9/86 (10%) |
|
|
|
|
27.3.1.Which of these two systems of staging would you prefer “intuitively,” i.e., by judgmental inspection of the data? Why?
27.3.2.The investigators would like some quantitative indexes to summarize and compare the results of the two systems. What are the values of the slope and chi-square test for linear trend?
27.3.3.Give the results for at least three additional quantitative indexes that could be applied to each system.
© 2002 by Chapman & Hall/CRC
27.3.4.Which of the indexes in 27.3.2 and 27.3.3 do you believe is best for citing these results? Has it made you change your mind about the answer to 27.3.1? Why?
27.4.Using any sources at your disposal (available journals, computerized-literature search, etc.) find
a published paper that used any one of the additional indexes discussed in this chapter. Comment briefly on what was done and whether you think it was appropriate. Alternatively, find a published paper in which one of the indexes discussed in this chapter should have been used, but was not. If you choose this option, justify your reason for believing that the original procedure was “wrong” and for preferring the alternative index. With either option, enclose a copy of the selected publication (or a suitable abstract thereof).
© 2002 by Chapman & Hall/CRC
28
Non-Targeted Analyses
CONTENT
28.1Sources of Complexity
28.1.1Direction of Relationship
28.1.2Number of Variables and Symbols
28.1.3Scales of Variables
28.1.4Patterns of Combination
28.2Basic Concepts of Non-Targeted Analyses
28.2.1“Intellectual Dissection”
28.2.2Face Validity
28.3Algebraic Methods
28.3.1Analytic Goals and Process
28.3.2Operational Strategies and Nomenclature
28.3.3Rotations
28.3.4Factor vs. Component Analysis
28.3.5Published Examples
28.3.6Biplot Illustrations
28.4Cluster Analysis
28.4.1Basic Structures
28.4.2Operational Strategies
28.4.3Published Examples
28.4.4Epidemiologic Clusters
28.5Correspondence Analysis
28.6Psychometric Attractions
28.7Scientific Problems and Challenges
References
Although this chapter is concerned mainly with “non-targeted” forms of analysis, the first section describes sources of complexity that can occur here or in any multivariable and multicategorical procedures.
28.1 Sources of Complexity
The brave new statistical world we are about to enter can be catalogued according to diversity in at least four entities that have appeared before in relatively simple arrangements. The entities that now produce additional complexities are the direction of relationships, the number (and symbols) of variables, the scales of the variables, and the patterns of combination.
28.1.1Direction of Relationship
The distinction between targeted and non-targeted directions is a fundamental axis of classification for statistical procedures. All of the new tactics about to be discussed can be catalogued, according to their direction, as either targeted or non-targeted. In the previous discussion of targeted associations, an
© 2002 by Chapman & Hall/CRC
