Ординатура / Офтальмология / Английские материалы / Principles Of Medical Statistics_Feinstein_2002
.pdf
20.5.2.2“Squared-Deviation” Method — In an alternative method,17 partial disagreements are weighted according to “the square of the deviation of the pair of observations from exact agreement.” With the previous method of categorical-distance scoring, the partial agreements in a 5-category scale might be weighted as 4, 3, 2, 1, and 0. With the “squared-deviation” method, the corresponding partial disagreements would be rated as 0, 1, 4, 9, and 16. The disagreement ratings can be converted to a “unitary scale” when divided by the maximum rating. Thus, if the maximum is 9, the foregoing ratings would be 0, .11, .44, and 1. If the maximum is 16, the ratings would be 0, .0625, .25, .5625, and 1.
The argument offered17 in favor of the “squared-deviation” proposal is that it is “an intuitively appealing standard usage,” with the advantage that “weighted kappa calculated with these weights approximately equals the product-moment correlation coefficient.” Since indexes of concordance were developed to avoid possibly misleading results from ordinary correlation coefficients, the “advantage” claimed for the “squared-deviation” method may not be “intuitively appealing” to all potential users.
20.5.2.3“Substantive-Impact” Method — Both the categorical-distance and the “squareddeviation” methods of weighting are chosen according to arbitrary statistical strategies. In a third scheme, the weights are assigned according to the substantive impact of the disparities. For example, if the neoplastic suspiciousness of cervical pap smears is rated as 0, 1, 2, 3, 4, a one-category disagreement between 3 and 4 will have fewer consequences than a disagreement between 0 and 1. With either a 3 or 4 rating, the patient will receive a further “workup” that probably includes colposcopy and cervical biopsy. A disagreement between 0 and 1, however, can lead to the clinical difference between simple reassurance vs. invasive further testing. Accordingly, a discrepancy between 0 and 1 may be given a much greater “penalty” than any of the other one-category disagreements.
This type of problem is particularly likely to occur when the ordinal rating scale begins with a null, negative, or absent category, followed by categories that have different degrees of a “positive” rating. These null-based scales — such as 0, 1, 2, 3, 4 for pain or for pap smears — differ from other ordinal scales, such as I, II, III for TNM stage of cancer, where something “positive” is always present. In TNM stages, however, a discrepancy between I and II, i.e., between a localized state and regional spread, might be regarded as more (or less) serious than the regional vs. distant metastases implied by a discrepancy between II and III.
Cicchetti18 has developed a formal procedure for assigning substantive weights for disagreements in what he calls dichotomous-ordinal scales, where the extreme rating at one end has a special connotation (usually “normal” vs. a set of “abnormal” ratings). In continuous-ordinal scales, the disagreements in adjacent increments have equal connotations. As long as the weights are established before the statistics are computed, the assigned weights seem scientifically reasonable and offer increased flexibility in the evaluation procedure.
A substantive-impact weighting procedure can also be used for appraising concordance in various other situations. For example, substantive weights can be assigned (as noted later in Section 20.6.2) for disagreements among nominal categories. When errors are assessed in diagnostic marker tests (see Chapter 21), a false positive rating may sometimes be given a greater (or lesser) “penalty” than a false negative rating.
20.5.2.4Other Weightings — Cicchetti18 has proposed another weighting scheme that assigns weights as proportions between 0 and 1, and Fleiss9 mentions two other arrangements beyond those already cited.
20.5.3Proportion of Weighted Agreement
To form a summary score, the weighted indexes for individual agreements or disagreements are multiplied by frequency counts; the products are added and then converted to suitable proportions.
In the categorical-distance method, suppose fi is the frequency count in each cell and wi is the corresponding weight of agreement. The total agreement score for the table will be Σ fiwi. For perfect agreement, all of the N pairs of ratings would be in the diagonal cells getting weights of g − 1; and so the perfect score would be N(g − 1). Thus, the weighted proportion of agreement would be
© 2002 by Chapman & Hall/CRC
pw |
Σ fi wi |
[20.9] |
= -------------------- |
||
|
N(g – 1) |
|
To illustrate this procedure in the five ordinal ranks of Table 20.6, the perfect-agreement cells all receive weights of 4 for the 134 (= 91 + 20 + 4 + 10 + 9) frequencies in those locations. For the remaining cells, disparities of one category occur in the locations of (0,1), (1,0), (1,2), (2,1), (2,3), (3,2), (3,4), and (4,3). Each disparity receives a weight of 3 for the 84 (= 28 + 33 + 11 + 3 + 1 + 5 + 2 + 1) frequencies in those locations. Two-category disparities, receiving weights of 2, occur in the (0,2), (2,0), (1,3), (3,1), and (4,2) locations, which contain 23 (= 11 + 6 + 3 + 1 + 2) frequencies. Three-category disparities, with weights of 1, occur with 5 (= 1 + 2 + 2) frequencies in the (0,3), (3,0), and (1,4) locations. The maximum possible 4-category disparity, which would receive a weight of 0, did not occur in the (0,4) or (4,0) cells of the table. Consequently, the total weighted agreement score is (134 × 4) + (84 × 3) + (23 × 2) + (5 × 1) = 839. Because a perfect score would have been 246 × 4 = 984, the proportion of weighted agreement for Table 20.6 is 839/984 = 85%. (Without weighting, the ordinary proportion of agreement would have been 134/246 = 54%.)
With the squared-deviation method, the weighted score for each categorical disagreement would be qi; the frequency of scores would be fi; the score for maximum disagreement in N ratings would be
N(g – 1)2; and the score for proportion of weighted agreement would be |
|
pw = 1 – {Σ fi qi /[N(g – 1)2 ]} |
[20.10] |
To illustrate this process in the data of Table 20.6, the total score for disagreement would be (134 × 0) + (84 × 1) + (23 × 4) + (5 × 9) + (0 × 16) = 221. The score for perfect disagreement would be 246 × 16 = 3936. The proportion of weighted agreement would be 1 − (221/3936) = .94. (With the “categorical distance” method, the corresponding value was .85.)
As another example of the weighted scoring process, suppose two clinical pharmacologists, A and B, have independently appraised 30 cases of suspected adverse drug reaction,19 rating each case as definite, probable, possible, or unlikely. The results are shown in Table 20.8.
TABLE 20.8
Agreement Matrix of Two Clinical Pharmacologists, A and B, in Rating the Likelihood of Adverse Drug Reactions in 30 Suspected Cases
[Data from Chapter Reference 19]
|
|
|
Rater A |
|
|
|
|
|
Rater B |
Definite |
Probable |
Possible |
Unlikely |
Total |
|
|
|
|
|
|
|
|
|
|
Definite |
1 |
2 |
|
0 |
0 |
3 |
|
Probable |
1 |
5 |
|
3 |
1 |
10 |
|
Possible |
1 |
4 |
|
5 |
2 |
12 |
|
Unlikely |
1 |
1 |
|
1 |
2 |
5 |
|
Total |
4 |
12 |
|
9 |
5 |
30 |
|
|
|
|||||
The index of percentage agreement for the appropriate diagonal cells is |
|
||||||
|
|
|
1------------------------------+ 5 + 5 + 2 (100) |
= 43.3% |
|
|
|
|
|
|
30 |
|
|
|
|
For the index of weighted percentage agreement with categorical-distance scoring, a weight of 3 is given for perfect agreement among the four ordinal categories. A one-category disagreement is given a weight of 2; a two-category disagreement is weighted as 1; and the maximum disagreement (of three categories) is weighted as 0. Using Formula [20.9], the index of weighted percentage agreement is calculated as:
(---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------1 + 5 + 5 + 2 )(3 ) + (2 + 3 + 2 + 1 + 4 + 1 )(2 ) + (0 + 1 + 1 + 1)(1) + (0 + 1)(0) |
= 75.6% |
(30 )(3) |
|
© 2002 by Chapman & Hall/CRC
This value is considerably higher than the 43.3% unweighted index of percentage agreement. With the squared-deviation scoring method, using Formula [20.10], the value of Σ fiqi is (1 + 5 + 5 + 2)0 + (2 + 3 + 2 + 1 + 4 + 1)1 + (0 + 1 + 1 + 1)4 + (0 + 1)9 = 13 + 12 + 9 = 34. The score for N(g − 1)2 is 30 × 9 = 270. The value of pw is 1 − (34/270) = .874, which is again higher than the .756 calulated with the categorical-distance method.
20.5.4Demarcation of Ordinal Categories
In all of the foregoing methods, the analyst began with an array of ordinal categories, and then chose statistical weights for the different degrees of partial agreement or disagreement. A more fundamental scientific challenge can arise, however, in demarcating the array of categories before all the statistical work begins.
Getting commensurate scales is a common problem in observer variability because the raters may use different expressions for their routine activities. Thus, one rater may refer to cardiac size in four categories as normal, slightly enlarged, moderately enlarged, or substantially enlarged, whereas another rater may use six categories, which include the previous four plus borderline (after normal) and massively enlarged (after substantially enlarged). The problem is magnified when the raters use additional categories formed by qualifying expressions such as compatible with normal or consistent with slightly enlarged.
The investigator’s work will be eased if the raters can be persuaded to agree on a common scale of standard categories before the research begins, but the raters may then protest that the standardized scale is an artifact and that the results will not accurately reflect what happens in routine clinical practice. Because every research study creates artificial conditions, this problem cannot be avoided if observer variability is to be investigated at all. Recognizing that perfection may be impossible, the investigator can simply try to use a maximum of clinical “common sense” to achieve a best possible consensus from the participating observers. Sometimes, when several options are available for demarcating or consolidating categories, the investigator can analyze results for each option separately. If the results differ substantially, each set can be reported separately.
For example, in a study of observer variability in mammography,20 the original scale of four diagnostic categories was normal; abnormal, benign; abnormal, indeterminate (i.e., uncertain whether benign or probably malignant); and abnormal, suspicious of cancer. For one set of analyses using three ordinal categories, the investigators consolidated the middle two categories into a single benign/indeterminate rating. Because some of the mammographers believed that the indeterminate group would receive essentially the same subsequent “workup” as the suspicious-of-cancer group, another set of threecategory analyses was done with the latter two groups collapsed into a single category. Finally, in yet another analysis, all four of the original categories were retained.
20.5.5Artifactual Arrangements
A different set of problems arises if the observers are accustomed to certain operating conditions that are altered for the research. For example, pathologists and radiologists regularly see accounts of a patient’s history before they examine the pertinent slides or films. If the history, suspected as a source of biased interpretation,21 is not supplied, the observers may then complain that the research process did not conform to reality.
As another example, when the initial interpretation is equivocal, a final conclusion may not be reached until pathologists have ordered and examined additional slides or stains, or until radiologists have checked additional views or other films. If the research arrangement forces the observers to reach a final conclusion without access to the additional options, the investigative process may again be deemed unfair.
This problem is also seldom avoidable within the pragmatic constraints of realities for both ordinary practice and concordance research. The practitioners who volunteer to participate in studies of observer variability may themselves be an unrepresentative sample; and the investigator must hope to come as close as possible to the “average” conditions of clinical practice, while acknowledging that individual idiosyncrasies may have been suppressed and that no investigation can be done without certain artifacts.
© 2002 by Chapman & Hall/CRC
The Heisenberg principle (which states that the act of observation may change the observed object) pertains not only for the relatively simple phenomena of the world of physics, but particularly for the more complex phenomena of observer variability. Sometimes, however, the artifacts can be effectively used for testing certain hypotheses. Thus, to determine whether the patient’s history is indeed a source of bias, the observers may be given and asked to report on selected specimens that have been submitted on two occasions with and without an accompanying history.21 In one study of radiographic interpretations of change in a sequence of chest films,22 the radiologists on one occasion reviewed films of the same patients arranged in the chronologic succession and, on another occasion, in a random chronology.
20.5.6Weighted Kappa
Regardless of the selected categories and weighting scheme, however, the index of weighted percentage agreement does not make provision for the agreement that might be expected by chance. If a correction factor is introduced for chance agreement, the best index for concordance in ordinal data is weighted kappa, κ w. It is derived from κ after weights are assigned for the magnitude of observed disagreements. κ w is easier to calculate when based on q, the proportion of disagreements, rather than p, the proportion of agreements. Since q = 1 – p,
κ w |
= 1 |
qo′ |
[20.11] |
– ------ |
|||
|
|
q′c |
|
where qo′ = observed proportion of weighted disagreements and q′c = chance-expected proportion of weighted disagreements. (The “primes” are added to each q to indicate that the quantities are weighted.)
The most commonly used strategy of indexing disagreements for weighted kappa is a reverse counterpart of the categorical-distance method. The varying degrees of disagreement are given “reverse” weights as follows: 0 = perfect agreement (e.g., A and B both report “moderate” pain); 1 = one-category disagreement (e.g., “severe” vs. “moderate”); 2 = two-category disagreement (e.g., “mild” vs. “severe”), and so on up to a maximum weight of g − 1, where g is the number of categories in the ordinal scale. (With the other two methods of weighting, the corresponding sequence would be 0, 1, 4, 9,… for the “squared deviation” method or whatever weights are assigned for the “substantive-impact” method).
In Table 20.9, Table 20.8 has been modified to include the chance-expected cell frequencies (fc) and assigned cell weights (wi) of “categorical distance” that would be used in addition to the observed cell frequencies (fi) in the calculation of κ w. As in the case of unweighted kappa, fc is calculated by multiplying the appropriate marginals, i.e., row total by column total, and then dividing by N. Proportions of weighted disagreements are computed by multiplying cell frequencies (fi or fc) by the disagreement weight (wi) assigned to that cell, summing these values over all cells, and then dividing by N.
For the calculation of weighted kappa in Table 20.9,
′ Σ wi fi qo = ------------
N
(0)(1 + 5 + 5 + 2) + (1)(1 + 4 + 1 + 2 + 3 + 2) + (2)(1 + 1 + 0 + 1) + (3)(1 + 0) = ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
30
22 = ----- = .733
30
and
© 2002 by Chapman & Hall/CRC
q c′ = Σ----w-----i-f--c |
|
|
|
N |
|
|
|
= (---0---)---(--0.4---------+----4.0---------+-----3.6--------+-----0.8-------)----+-----(--1---)--(---1.3--------+-----4.8---------+----1.5---------+-----1.2--------+-----3.0--------+-----2.0-------)- |
|||
|
30 |
|
|
+ (------------------------------------------------------------------------------------------------------2 )(1.6 + 2.0 + 0.9 + 1.7 ) + (3 )(0.7 + 0.5 ) = |
29.8--------- = .933 ; and |
||
|
30 |
|
30 |
κ w |
= 1 – .733---------- |
= +.262. |
|
|
.993 |
|
|
TABLE 20.9
Agreement Matrix Containing Observed and Expected Frequencies and Assigned Weights for Ratings of Adverse Drug Reactions in Table 20.8. [From Chapter Reference 19]
|
|
Rater A |
|
|
|
|
Rater B |
Definite |
Probable |
Possible |
Unlikely |
|
|
|
|
|
|
|
|
|
Definite |
1( 0.4)\0 |
2( 1.2)\1 |
0( 0.9)\2 |
0( 0.5)\3 |
3 = r1 |
|
Probable |
1( 1.3)\1 |
5( 4.0)\0 |
3( 3.0)\1 |
1( 1.7)\2 |
10 = r2 |
|
Possible |
1( 1.6)\2 |
4( 4.8)\1 |
5( 3.6)\0 |
2( 2.0)\1 |
12 = r3 |
|
Unlikely |
1( 0.7)\3 |
1( 2.0)\2 |
1( 1.5)\1 |
2( 0.8)\0 |
5 = r4 |
|
|
4 = c1 |
12 = c2 |
9 = c3 |
5 = c4 |
30 = N |
|
|
|
|
|
|
|
|
Note: Numbers in cells represent observed frequencies, (fa); numbers in parentheses indicate chance-expected cell frequencies (fa); numbers in upper right corner are the assigned weights (wi) for disagreements.
The quantitative magnitude of κ w is interpreted in the same way as the magnitude of unweighted kappa. The values range from −1 to +1, with 0 representing chance-expected weighted agreement.
20.5.7Correlation Indexes
Analogous to the use of φ for describing trends between two dichotomous categorical variables, nonparametric correlation indexes such as rs (Spearman’s rho) and τ (Kendall’s tau) — which are discussed later in Chapter 27 — have sometimes been applied for analyzing concordance in ordinal data. Denoting trend rather than agreement, however, these indexes refer to correlation or general relatedness, not concordance.
In particular, the correlation indexes ignore systematic bias. If Observer B consistently assigns higher rankings than Observer A, while using the same order of rankings, the correlation will be excellent although agreement is poor. On a four-category pain scale, for example, if B reports “mild” pain whenever A reports “none,” “moderate” whenever A reports “mild,” and “severe” whenever A reports “moderate” or “severe,” the correlation will be quite high, despite only a modest degree of concordance.
20.6 Agreement in Nominal Data
Agreement has seldom been assessed for circumstances in which the observer chooses one of a series of Š 3 nominal categories. In one example, citation of four possible body sites of melanoma was compared, with ordinary (non-weighted) indexes of agreement, for the locations listed in physicians’ office records and in a hospital cancer registry.23 The usual situation that evokes a choice among nominal categories, however, is a test of conformity between a series of diagnostic estimates and the “gold standard” results. The same statistical index is used for expressing either concordance or conformity in
© 2002 by Chapman & Hall/CRC
nominal data. Table 20.10 shows the agreement matrix for a hypothetical series of clinical diagnoses of liver disease and the “gold standard” results of liver biopsy.
TABLE 20.10
Nominal Diagnoses by Clinician and by Liver Biopsy in Hypothetical Data Set
Clinical |
|
Results of Liver Biopsy |
|
|
|
Diagnosis |
Hepatitis |
Cirrhosis |
Cancer in Liver |
Other |
Total |
|
|
|
|
|
|
Hepatitis |
10 |
3 |
2 |
6 |
21 |
Cirrhosis |
4 |
20 |
3 |
2 |
29 |
Cancer in Liver |
3 |
5 |
13 |
4 |
25 |
Other |
2 |
1 |
1 |
35 |
39 |
TOTAL |
19 |
29 |
19 |
47 |
114 |
|
|
|
|
|
|
Because nominal data cannot be ranked, the only straightforward index for Table 20.10 is the proportional agreement, which would be expressed as (10 + 20 + 13 + 35)/114 = 78/114 = 68%.
20.6.1Substantively Weighted Disagreements
Although the individual values cannot be ranked, certain pairs of nominal disagreements may be weighted substantively as “better” or “worse” than others. For example, suppose we were estimating a person’s birthplace in the nominal category of Alabama, Alaska, Arizona,…, Wyoming, for one of the 50 United States. For these predictions, a set of discrepancies might be weighted on the basis of geographic proximity. Thus, if someone were born in New Hampshire, the estimate of Vermont would be much closer than the estimate of California.
An arbitrary scheme of substantive weights might also be established for disagreements in the diagnoses of Table 20.10. Thus, a disagreement with the diagnosis of cancer might be rated as a much worse discrepancy than disagreements for any other pair of diagnoses. In a study24 of variability in ratings of cell types for the histopathology of lung cancer, a disagreement of well-differentiated epidermoid vs. well-differentiated adenocarcinoma was weighted more heavily than a disagreement of poorly-differentiated adenocarcinoma vs. large cell anaplastic.
Unless such substantive ad hoc weights are established, the only descriptive index for nominal data is the simple proportion of agreement. Because such studies are uncommon, the results are usually reported without any adjustments for chance. If desired, however, kappa statistics can be calculated by converting the nominal scales into a series of dichotomous scales such as diagnosis A vs. all others, B vs. all others, or C vs. all others. Values of proportional agreement and kappa could then be determined for each of the 2 × 2 tables created by the dichotomous scales. The final result would be the medians (or means) of the series of values for proportional agreement and kappa.
20.6.2Choice of Categories
If the observers are accustomed to a free range of expressions, the choice of rating-scale categories can be even more difficult for nominal than for ordinal challenges in observer variability. For example, when interpreting the cell types of a series of histologic specimens of lung cancer, some of the pathologists24 used sparse numbers of categories such as epidermoid carcinoma, adenocarcinoma,…, whereas others used many more categories with additional qualifying details such as well-differentiated, moderately well-differentiated, and poorly differentiated for the epidermoid group and yet other details (acinar, papillary, bronchiolar) for the adenocarcinomas.
In such situations, special analytic arrangements may be needed beyond the usual indexes for expressing agreement. In the study just cited, a special set of “spectral numbers” was calculated for the different designations the same slide might have received in both the intra-observer and interobserver interpretations.
© 2002 by Chapman & Hall/CRC
20.6.3Conversion to Other Indexes
In some nominal-rating circumstances, the results are converted into binary or summated expressions. For example, the nominal-category coding of ethnicity was expressed in binary indexes for each category vs. all others in a study of agreement between birth and death certificates.25 In another study,26 ten categories of possible differences were established between the ante-mortem clinical diagnoses and the post-mortem diagnostic decisions. The results of each category were given a point for agreement, no points for disagreement, and a blank for not pertinent. The points were then added to form a type of “batting average” called a concordance score.
A different type of summary score was used when Loewenson et al.,27 examining the consistency of pathologists’ assessments of cerebrovascular atherosclerosis, gave one point for each occlusion noted in the mounted specimens of a set of transversally cut pieces of cerebral arteries. The participating pathologists were then checked for agreement in their total point scores.
20.6.4Biased Agreement in Polytomous Data
Indexes of bias in direction and magnitude are seldom sought for agreement matrixes expressed in polytomous (i.e., Š 3) categories of ordinal or nominal data. When desired, however, a modified McNemar index can be calculated from the rows and columns arranged appropriately around the main diagonal, which contains the cells of perfect agreement. The cells of disagreement are divided into an upper group, above the agreement diagonal, and a lower group below the diagonal. The sum of frequency counts in the upper disagreement cells is U and the corresponding sum in the lower cells is L.
For nominal data, the modified McNemar index is then |U − L|/(U + L). For example, in Table 20.10, the sum of the upper disagreement cells is U = 3 + 2 + 6 + 3 + 2 + 4 = 20. The corresponding lower sum is L = 4 + 3 + 5 + 2 + 1 + 1 = 16. The index of disagreement would be (20 − 16)/(20 + 16) = 4/36 = 11%. For calculating U and L in ordinal data, the distances from the diagonal, i.e., weighted disagreements, can be taken into account in a manner resembling the tactics used in Section 20.5.2.
20.7 Agreement in Dimensional Data
Expressing agreement for pairs of dimensional data would at first seem to be a relatively simple procedure. We want to know the individual discrepancies, their summary, an indication of inter-rater bias (whether one rater is regularly higher than the other), and an indication of directional bias (whether the discrepancies change with magnitude of the ratings). If the ratings from Raters A and B are cited respectively as Xi and Yi, these goals can readily be achieved from examining and appropriately analyzing the increments, di = Xi – Yi.
This simple approach has been ignored for many years, however. Instead, the paired data are usually analyzed with various types of correlation analysis that are unsatisfactory, as discussed in Section 20.1, because they express trends rather than agreements. Furthermore, at least two different types of correlation analysis have been used — the ordinary least-squares regression procedure discussed in Chapter 19, and an intraclass correlation that makes a brief debut in this chapter, with further discussion in Chapter 29.
20.7.1Analysis of Increments
The increments noted as di = Xi − Yi can be evaluated directly to offer a summary of results and to provide indications of bias in raters and directions.
20.7.1.1 Direct Increments — Consider the data shown in Table 20.11 comparing two methods, A and B, for measuring serum sodium. A quick look at the data suggests that the methods yield reasonably close results, but a more careful inspection of the individual increments shows that they are
© 2002 by Chapman & Hall/CRC
respectively −5, −3, −3, −2, −5, −7, −2, −6, −2, and −5 for the values of Method A – Method B in subjects 1 through 10. This examination immediately shows that Method B, in this group of data, always has higher values than Method A. For the 10 increments, Σ di = − 40 and d = − 4.0.
To get an idea of relative magnitude for this disparity, we can calculate its relationship to the actual measurements. The mean of the values is 140.0 in Method A and 144.0 in Method B, with an overall mean of 142.0. The ratio of the average disparity to the average measured value will be −4.0/142.0 = –.028, which is about a 3% difference.
TABLE 20.11
Comparison of Two Methods (A and B) for Determining Serum
Sodium Concentration (in mEq/ l) in 10 Subjects
|
Method |
Method |
Subject |
Subject No. |
A |
B |
Means |
|
|
|
|
1 |
136 |
141 |
138.5 |
2 |
142 |
145 |
143.5 |
3 |
129 |
132 |
130.5 |
4 |
148 |
150 |
149.0 |
5 |
140 |
145 |
142.5 |
6 |
152 |
159 |
155.5 |
7 |
142 |
144 |
143.0 |
8 |
134 |
140 |
137.0 |
9 |
139 |
141 |
140.0 |
10 |
138 |
143 |
140.5 |
Method means |
140.0 |
144.0 |
Overall mean: |
|
|
|
142.0 |
|
|
|
|
20.7.1.2 Absolute and Squared Increments — The main problem in examining the means of direct increments is that negative and positive values may cancel one another in situations less extreme than that of Table 20.11. For example, suppose the compared rating systems produce the following values for three persons:
Person |
Rating by X |
Rating by Y |
Deviation: X − Y |
|
|
|
|
1 |
100 |
80 |
20 |
2 |
85 |
95 |
−10 |
3 |
63 |
73 |
−10 |
|
|
|
|
If we merely added the three deviations, as Σ di = Σ (Xi − Yi), the result would be 0, which falsely suggests perfect agreement. To eliminate this effect, we can examine either the absolute increments or the squared increments. For absolute increments here, Σ |di| = 40 and their mean would be 40/3 = 13.3. For squared deviations, the sum would be Σ d2i = 600. The root mean square, expressed as
Σ d2i /N , would be 
600/3 = 14.1 .
Because the mean of the six measured values is 82.7, the ratio of (mean discrepancy)/(mean measured value) would be either 13.3/82.7 = .16 or 14.1/82.7 = .17. For the data in Table 20.11, the mean absolute deviation is the same as the mean of direct deviations (because they all have the same sign). The root mean squared deviation is
190/10 = 4.36 , which is close to the direct value of 4.0.
20.7.1.3 Bias in Raters — The bias in raters in Table 20.11 was shown by the mean deviation of −4.0. To indicate the stability of this result, we can put a 95% inner percentile range around it. Bland and Altman29 have suggested the name “limits of agreement” for this interval. If the data are Gaussian, the interval will extend ±1.96 s.d. above and below the mean discrepancy. In Table 20.11, the standard
© 2002 by Chapman & Hall/CRC
deviation of the discrepancies is 1.83 and so the “limits of agreement” will be −4.0 ±(1.96) (1 .83), a zone that goes from −7.59 to −.41.
The foregoing “limits” were descriptive, being obtained with the standard deviation, not the standard error. For stochastic confirmation of the bias, the standard error of the increments would be 1.83 /
10 = .579. A 95% confidence interval would denote stochastic significance by excluding 0 in the extent of –4.0 ± (1.96) (.579), which goes from −2.87 to −5.13.
20.7.1.4 Bias in Zones of Data — An important directional problem arises from the magnitude of discrepancies in different zones of the data. For example, a difference of 10 units seems relatively small if the two measurements are 821 and 831, but not if the two measurements are 21 and 31. To deal with this problem, each discrepancy can be indexed as a proportion of the “correct” value (if a “gold standard” exists). In the absence of a gold standard, the mean of the two values, i.e., (X i + Yi)/2, can be the reference point.
The increments of Xi − Yi can then be plotted as a dependent variable in a graph where (Xi + Yi )/2 is the independent variable. If unbiased, the incremental values should vary randomly around zero from the smallest to largest values of (X i + Yi )/2. If a set of the increments becomes all positive or all negative in a particular zone, the measurements are biased in that zone.
Another potential problem in direction, however, is that the incremental values, although balanced around 0, may become much larger (or smaller) with increasing (or decreasing) magnitudes of (Xi + Yi)/2. The enlarging-discrepancy effect would suggest that the measurement process itself — rather than one of the raters — is biased, getting excessive disparities at the extreme values of measurement. A transfer to logarithmic values may sometimes help eliminate the problem.
For the data in Table 20.11, the plot of each di vs. the corresponding (Xi + Yi)/2 is shown in Figure 20.4. As expected, all of the increments have negative values, but our main concern here is whether their magnitudes are affected by the size of the corresponding mean values for (Xi + Yi)/2. An eye scan of the graph suggests very little relationship. The points above and below the mean increment of −4 are all within the Gaussian “limits-of-agreement” zone (from −7.59 to −.41) as (Xi + Yi)/2 increases.
6 |
|
INCREMENTS |
|||||||
|
|||||||||
|
|
||||||||
4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
-2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
-4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
-6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
-8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
130 |
135 |
140 |
145 |
150 |
155 |
|
|
MEAN VALUES |
|
|
|
FIGURE 20.4 |
|
|
|
|
|
Plot of increments vs. mean values of X i and Yi for data in Table 20.11. |
= Overall mean. |
||||
20.7.2Analysis of Correlation
The analysis of increments can indicate everything we want to know about pairs of dimensional data, but it has often been avoided in favor of correlation analysis. In the latter procedure, the sets of points for {Xi, Yi} are plotted on a graph similar to Figure 20.2, and then receive a set of regression-correlation calculations. Readers can then be impressed with the relatively close fit of the line and with the high values of r (such as .978 in Figure 20.2).
© 2002 by Chapman & Hall/CRC
The regression approach can yield a reasonably satisfactory index of agreement provided that the line has a slope of 1 (indicating a 45° angle) and an intercept at the origin. If either of these values significantly deviates from the goal, agreement may be poor, although trend is excellent. Besides, the regressioin/correlation approach does not indicate the relative magnitude of individual discrepancies or direction in bias.
Although the superiority of incremental analysis is sometimes regarded as a “modern” discovery, the procedure was used as early as over 60 years ago. When the distinguished Indian statistician, P. C. Mahalanobis30 was examining the “question of correlation between errors of observation … in physician measurements,” he considered the increment and standard error of the means in the pairs of observation. (Mahalanobis also noted the phenomenon that biased observations occurred more often “than one would expect from the normal theory.”)
20.7.3Analysis of Intraclass Correlation
When R. A. Fisher proposed7 the intraclass correlation coefficient, he was interested in single measure - ments of paired entities (such as two brothers) rather than paired measurement of single entities (such as serum sodium). Because the paired entities did not have an assigned position (with Method A as the Xi values and Method B as the Yi values), either entity could be regarded as X i or Yi. Fisher’s approach to this situation was to list the entities both ways. He assembled N pairs
of an {X i , Yi} arrangement and then reversed their order to form N interchanged pairs with each Y i in the Xi position and vice versa. He gave the name intraclass correlation coefficient (ICC) to the ordinary correlation
coefficient calculated for the 2N pairs of data. The procedure is lucidly |
63 |
73 |
|
80 |
100 |
||
described and well illustrated by Robinson.31 For example, for the three |
|||
95 |
85 |
||
persons whose ratings were reported in Section 20.7.1.2., Fisher’s set of |
|||
73 |
63 |
analyzed values are shown in the accompanying list at the right.
The ICC procedure became popular among statisticians because it used Fisher’s analysis-of-variance approach (further discussed in Chapter 29). The “unassigned locations” of the X and Y values were also appealing for psychometric assessments of “reliability” in repeated tests.
20.7.3.1Sources of Variations — For the intraclass analysis, the results are partitioned according to two main sources of variation: the inter-individual variations among the individuals being rated and the intra-individual variations among the raters. These variations are expressed as means of the pertinent group variances; and appropriate ratios of those mean variances form the interclass
correlation coefficient, RI . The process resembles the partitioning of group variance for linear regression in Chapter 19, but the RI is calculated from the means of the group variances.
20.7.3.2Example of Calculation — For the data in Table 20.11, the inter-individual group
variance is SX XA + SXXB . For n members, each group has n − 1 degrees of freedom, so the total for degrees of freedom is 2n − 2. The mean of the group variance will be (394 + 442)/(20 − 2) = 46.44.
For the intra-individual group variance, Sdd = Σ (di − d )2 = 30, and there are n − 1 = 9 degrees of freedom. The mean will be 30/9 = 3.33.
If s2I represents the inter-individual variance, and s2o represents the intra-individual variance, the intraclass correlation coefficient is
RI = sI2 /(sI2 + so2 ). |
[20.12] |
In this instance, RI = 46.44/(46.44 + 3.33) = .93. Because RI will vary from 0 to 1, the value of .93 seems impressively high (as the ordinary correlation coefficient would be).
20.7.3.3 Problems and Complexities in RI — An immediately evident problem in RI is that the high value just noted for the data in Table 20.11 does not indicate the discrepancies — with method
© 2002 by Chapman & Hall/CRC
