Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Ординатура / Офтальмология / Английские материалы / Principles Of Medical Statistics_Feinstein_2002

.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
25.93 Mб
Скачать

measurement can be called the rater. The term rater has two advantages over observer. First, many decisions are expressed as ratings in a scale of categories, such as adenocarcinoma or 1+ cardiac enlargement, rather than as purely descriptive statements, such as “roseating clusters of multi-nucleated cells” or “outward bulge of apex.” Second, when the observed entity requires technologic preparation — such as a radiographic film — variations can arise from the way the technologic apparatus was used in positioning the patient or in taking and developing the film. Since “variability” can arise from both the technologic process that produces an entity and the person who interprets it, the term rater indicates the latter source.

The term observer variability can then include the combination of variations that can arise from both the process and the rater. In certain technologic measurements where a numerical value is produced without the intercession of a human rater, the variability is usually called process, because human variations can still arise when the results are transcribed or transmitted.

20.3.3Number of Raters

Any comparison requires at least a pair of ratings for each entity, but the pair can come from one rater or two, and sometimes the same entity can receive more than two ratings.

20.3.3.1Single Rater — If the same rater provides both ratings, the comparison can have several formats. In a common arrangement, intra-rater concordance is assessed when the rating is repeated by the same observer (such as a radiologist or pathologist) after enough time has elapsed for the first rating to have been forgotten.

In other circumstances, which might be called intraclass concordance, two (or more) ratings are available almost simultaneously. This situation can occur when the same laboratory measures aliquots of the same specimen to check repeatability of the process. In another situation, individual ratings are available for somewhat similar entities, such as a pair of twins or brothers. In the latter situations, the ratings cannot be assigned to a specific source, such as Method A vs. Method B, or Sibling A vs. Sibling B, because either one of the ratings can go in the first position. The management of these “unassigned” pairs was the stimulus for development of the intraclass correlation coefficient, discussed later.

20.3.3.2Two or More Raters — The most common test of agreement is the inter-rater concordance between two ratings for the same entity, offered by Rater A vs. Rater B. Even when the same entity receives more than two ratings, many investigators prefer to check pairwise concordances for rater A vs. B, A vs. C, B vs. C, etc. rather than calculate a relatively nonspecific single index that covers all the raters simultaneously.

If measurements of the same entities come from three or more raters, concordance can be described for the pertinent pairs of two raters, or for the overall results among the group of raters. Because the mathematical activities become particularly complex for more than two sets of ratings, the challenge of indexing multiple observers will be saved for the end of the chapter in Section 20.9.The main discussion throughout the chapter will emphasize pairwise comparisons for two raters.

20.3.4Types of Scale

Concordance can be checked only if each entity is rated in commensurate scales, having the same values available for citation in each scale. Statistical indexes will be needed for patterns in which the scales for both variables are: dimensional, e.g., laboratory measurements; binary, e.g., diagnostic marker tests; ordinal, e.g., ratings for staging systems or grades of clinical severity; or nominal, e.g., histopathologic categories.

20.3.5Individual and Total Discrepancies

Each of the four patterns of data needs arrangements and indexes to cite individual discrepancies between each pair of ratings and to summarize the total pattern of discrepancies.

© 2002 by Chapman & Hall/CRC

20.3.5.1Categorical Data — For categorical data, the results are arranged in a two-way contingency table that is often called an agreement matrix. It shows the frequency counts in each cell formed by the rows and columns for each observer’s ratings. Individual pairs of ratings will have different degrees of disagreement according to whether the scales are binary, ordinal, or nominal. Perfect agreements occur in the appropriate diagonal cells; and all other cells show different degrees of partial disagreement that are managed as discussed later.

The frequency counts for perfect or partial agreements in the cells can be added and then divided by the total to show proportions of different types of agreement.

20.3.5.2Dimensional Data — An ordi-

nary graph of dimensional data can show rater A’s results as {Xi}, and rater B’s corresponding measurements as {Yi}. Figure 20.2 is an example of such a graph, for agreement between plasma and salivary measurements of caffeine.

The individual discrepancies can be expressed as di = Xi Yi at each point, i, of the data. The sum of the discrepancies can be converted to a central index, such as a median, or a mean, which will be Σ di/N for N points of data.

To get a better idea of magnitudes, the individ - ual increments can be squared, added, divided by N, and then reconverted by taking the square root. The value of Σ d2i /N is called the quadratic mean or root mean square (mentioned in Section 3.8.1) of the increments. The smaller the square root value, the better is the agreement. This procedure is analogous to what was done in Section 7.8.1 for deviations in data for two paired groups or for before-and-after results in the same group. The increments here refer to deviations in pairs of measurements for the same N entities.

Other challenges in evaluating agreement for dimensional pairs of data will be discussed later in Section 20.7.

 

11

 

 

r = 0.978

 

 

 

 

 

 

 

 

10

 

 

 

 

 

 

 

 

 

 

 

ml )

9

 

 

 

 

 

 

 

 

 

 

 

( g /

 

 

 

 

 

 

 

 

 

 

 

8

 

 

 

 

 

 

 

 

 

 

 

Plasma

7

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

in

6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Concentration

5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4

 

 

 

 

 

 

 

 

 

 

 

Caffeine

3

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

1

2

3

4

5

6

7

8

9

10

Caffeine Concentration in Saliva ( g / ml )

FIGURE 20.2

Correlation between caffeine concentrations in plasma (ordinate) and saliva (abscissa) in 12 subjects after single oral dose over 16 hr.

20.3.6Indexes of Directional Disparity

Expressions of agreement do not indicate the directional effect of the discrepancies. For example, two raters can generally disagree in an apparently random manner, or one rater may be consistently higher or lower than the other. In addition, the discrepancies may have about the same magnitude in all zones of the data or may get larger or smaller in different zones. The bias in directional disparities will be expressed differently according to the four types of data, and also according to whether the bias refers to raters or to zones of data.

20.3.7“Adjustment” for Chance Agreement

Another distinctive challenge in studies of concordance is that agreements can arise by chance alone. For example, suppose you know nothing about the substantive content of a certifying examination containing five choices of a single correct answer for each question. With random guesses alone, you should correctly answer 20% of the questions. Analogously, if two radiologists regularly rate 90% of chest films as being normal, we could expect their normal ratings to agree randomly on 81% (= .90 × .90) of the occasions.

© 2002 by Chapman & Hall/CRC

In multiple-choice examinations, “guesses” are penalized when points are subtracted for wrong answers but not for “blanks” where no answer is offered. In studies of concordance, an analogous type of penalty can be used to make adjustment for the number of agreements that might have occurred by chance alone. If the raters are not given a well-chosen challenge, however, the adjustment process can produce an excessive penalty. For example, suppose two radiologists are tested for agreement on the chest films of a university freshman class, for whom about 95% (or more) of the films will be normal. Even if the radiologists achieve almost perfect agreement, they may be harshly penalized because so large a proportion of the agreement (.90 = .95 × .95) might be ascribed to chance alone. Consequently, the distribution in the challenge group becomes particularly important if agreement is adjusted for chance.

Adjustments for chance in categorical data are cited with the Kappa index discussed later, but are seldom (if ever) applied to dimensional scales, for which the same two values have only a remote chance of being chosen randomly from the limitless (or at least large) number of dimensional choices for each variable.

20.3.8Stability and Stochastic Tests

For tests of stability in small groups, the descriptive index of concordance is usually converted to a stochastic index that is checked with either a Z or chi-square procedure. Fortunately, the modern emphasis on stochastic tests has not extended to the evaluation of concordance. The descriptive indexes are usually reported directly and are seldom replaced by P values or confidence intervals. In fact, stochastic tests are often left unexamined or unreported, because the focus is on descriptive agreement and because the group sizes are usually large enough for the descriptive indexes to be regarded as stable.

Stochastic tests for concordance are discussed in Section 20.8.

20.4 Agreement in Binary Data

If Yes and No are used to represent the two categories of binary data, four possible results can occur for each pair of ratings. They form the 2 × 2 agreement matrix of frequency counts shown in Table 20.1.

TABLE 20.1

Agreement Matrix for Binary Data

 

 

Rater A

 

 

Yes

No

Total

 

 

 

 

Yes

a

b

f1

RATER B

c

d

f2

No

TOTAL

n1

n2

N

 

 

 

 

The two raters agree in the a and d cells, and disagree in the b and c cells, but the results have directional distinctions as follows:

 

 

 

Location of Cell in

Rater A

Rater B

Result

Table 20.1

 

 

 

 

Yes

Yes

Agree

a

No

Yes

Disagree

b

Yes

No

Disagree

c

No

No

Agree

d

 

 

 

 

If one of the raters is the “gold standard,” the other’s ratings are correct in the a and d cells and incorrect in the b and c cells. Without a gold standard, however, we need a way to summarize agreement rather than conformity.

© 2002 by Chapman & Hall/CRC

20.4.1Proportion (Percentage) of Agreement

Among many statistical proposals for summarizing index for agreement in binary data, only a few expressions are generally popular or valuable.

The most obvious, straightforward, and easiest-to-understand index is the proportion (or percentage) of agreement. It resembles a “batting average” in which the perfect agreements are “hits” and anything else is an “out.” In the symbols of Table 20.1, the proportion of agreement is

po

= a-----------+ d

[20.1]

 

N

 

In the example shown in Table 20.2, the two examiners agreed on 40 candidates, whom both passed, and on 20 others, whom both failed. The proportion or percentage agreement is (40 + 20)/80 = 60/80 = 75%.

TABLE 20.2

Agreement Matrix of Ratings on a Specialty Board

Certification Examination

Ratings by

Ratings by Examiner A

 

Examiner B

Pass

Fail

Total

 

 

 

 

Pass

40

2

42

Fail

18

20

38

TOTAL

58

22

80

 

 

 

 

The simplicity of the index of percentage agreement is accompanied by three disadvantages. First, it does not indicate how the agreements and disagreements are distributed. Thus, the same value of 75% agreement would be obtained for the two raters in Table 20.2 if the 2 × 2 agreement matrix for 80 ratings had any of the following arrangements:

 

20

18

 

30

10

 

60

10

 

0

10

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

 

10

 

 

10

 

 

10

 

 

40

 

30

 

0

 

60

A second disadvantage is that the index does not show whether agreement is better directionally when the examiners decide to pass a candidate than when they decide to fail. A third disadvantage is that the expression of percentage agreement makes no provision for the concordance that might occur by chance alone.

20.4.2φ Coefficient

The φ coeffi cient is an index of association (or correlation) for the data in a 2 × 2 table. The descriptive index, which is further discussed in Chapter 27, is calculated from the (uncorrected) X2 statistic as φ 2 = X2/N, where N = total number of frequency counts. Thus,

φ = X2 /N = (ad – bc)/ f1 f2 n1 n2

[20.2]

for the data in Table 20.1. Because X2 for a 2 × 2 table is calculated with values that are expected by chance, the φ coefficient seems to offer a me thod of adjusting for chance-expected results in a

2 × 2 agreement matrix. With Formula [20.2], the value of φ in Table 20.2 would be [(40)(20 ) (2)(18)] (58 )(22 )(42)(38) = .535 .

However, φ has several disadvantages. First, it is aimed at assessing the trend in two independent variables, rather than concordance between two closely related variables. Suppose the ratings in Table 20.2 were transposed to show the following results:

© 2002 by Chapman & Hall/CRC

 

Examiner A

 

Examiner B

Pass

Fail

Total

 

 

 

 

Pass

2

40

42

Fail

20

18

38

Total

22

58

80

 

 

 

 

The percentage of agreement would drop precipitously to only 25% [= (2 + 18)/80] but φ would have the same absolute value, although in a negative direction, as .535.

Because φ can take positive and negative signs, it might indicate whether the agreement is better or worse than might be expected by chance. The construction of φ , however, depends on examining the observed-minus-expected values in all four cells of the table. The value of φ thus contains individual discrepancies from cells showing agreements and disagreements; and it really tells us more than we want to know if the goal is to look at agreement alone or at disagreement alone.

Furthermore, the usual calculation of X2 refers to N different members, each “rated” for a different variable. In an agreement matrix, however, each member has been rated twice for the same variable. There are really 2N ratings under consideration, not N. Thus, if we rearrange Table 20.2 to show total ratings rather than an agreement matrix for two observers, we would get the 160 ratings that appear in Table 20.3. The two arrangements for the same set of data are often called matched (in Table 20.2) and unmatched (in Table 20.3). The differences become important later (in Chapter 26) when we consider appropriate tabular structures for appraising results of exposed and non-exposed persons in a matched-pair casecontrol study.

TABLE 20.3

Total Ratings in Table 20.2

 

 

Rating

 

Examiner

Pass

Fail

Total

 

 

 

 

A

58

22

80

B

42

38

80

TOTAL

100

60

160

 

 

 

 

Despite these apparent disadvantages, the results of φ are usually reasonably close to those of the chance-adjusted kappa discussed in the next section. In fact, it can be shown algebraically that when the marginal totals of Table 20.1 are equal, i.e., n1 = n2 and f1 = f2 , the values of φ and kappa are identical. The φ coeffi cient was recently used, without apparent complaint by editors or reviewers, to adjust for chance in reporting agreement among observers’ opinions about whether antihistamine treatment had been used prophylactically during anesthesia.8

20.4.3Kappa

Kappa is now regarded as the best single index to adjust for chance agreement in a 2 × 2 table of concordance. Specifically designed for this purpose by the clinical psychologist, Jacob Cohen,5 and medically popularized by J. L. Fleiss,9 kappa forms a ratio that adjusts the observed proportion of agreement for what might be expected from chance alone.

To illustrate the strategy, consider the situation in Table 20.2. Because Examiner A passes the proportion, pA, of candidates and Examiner B passes the proportion, pB, we would expect the two examiners to agree by chance in the proportion pA × pB. Similarly, the two examiners would be expected to agree by chance in failing the proportion qA × qB. For the data in Table 20.2, pA = 58/80 = .725; pB = 42/80 = .525; qA = 22/80 = .275; and qB = 38/80 = .475. The sum of (.725)(.525) + (.275)(.475) = .381 + .131 = .512 would therefore be expected as the chance proportion of agreement for the 80 candidates being rated. Because the observed proportion of agreement was 60/80 = .75, the observed agreement exceeded the

© 2002 by Chapman & Hall/CRC

chance expectation by .750 .512 = .238. If the observed agreement were perfect, po would be 1 and the result would have exceeded chance by 1 .512 = .488. The ratio of the observed superiority to perfect superiority is .238/.488 = .488, which is the value of kappa.

Expressed in symbols, using po for observed proportion of agreement and pc for the agreement expected by chance, kappa is

κ = p---------------o – pc

[20.3]

1 – pc

 

20.4.3.1Computation of Kappa — To convert expression [20.3] into a calculational formula,

we can use the symbols of the 2 × 2 agreement matrix. The agreement expected by chance is

pc = pA pB + qA qB

=

f1

×

n1

+

f2

×

n2

= (f1 n1 + f2 n2 )/N

2

---

----

---

----

 

 

 

N

 

N

 

N

 

N

 

 

Because the observed agreement is (a + d)/N, the observed minus expected values will be [(a + d)/N]

(f1 n1 + f2 n2)/N2 = [N(a + d) (f1n1 + f2n2)]/N2. The value of 1 pc will be [N2 (f1n1 + f2 n2)]/N2, and so the calculational formula for kappa will be

κ = N--------------------------------------------------------(a + d) (f1 n1 + f2 n2 )

[20.4]

N2 (f1 n1 + f2 n2 )

 

An alternative calculational formula, whose derivation is left as an algebraic exercise for the reader, is

κ

=

2(ad – bc)

[20.5]

(-----------------------------------------------------b + c )N + 2(ad – bd )

Formula [20.4] is probably most rapid to execute on a hand calculator if the row and column totals are available and if the calculations are suitably organized.

For the two medical-board examiners in Table 20.2,

κ =

2(40 × 20 – 2 ×

18 )

18 ) =

1528

 

(-------------------------------------------------------------------------------------2 + 18 )(80 ) + 2 (40 ×

20 – 2 ×

1600-----------------------------+ 1528 = 0.49

using Formula [20.5], or

 

 

 

 

κ

= 80----------------------------------------------------------------------------------(40 + 20 ) (58

× 42 + 22 × 38)

= 1528-----------

= 0.49

 

802 (58 × 42 + 22 ×

38 )

3128

 

using Formula [20.4].

20.4.3.2 Interpretation of Kappa — Kappa has a range of values from 1 to +1. When the observed agreement is perfect, i.e., po = 1, kappa will be +1. If the observed agreement equals the chanceexpected agreement, so that po = pc , kappa will be 0. If the observed agreement is less than the chance expected agreement, i.e., po < pc, kappa will become negative. In the special case when both raters choose each of the categories half the time, p c will be [(N/2)(N/2) + (N/2)(N/2)]/N2 = 1/2. In this case, if the raters also happen to have no agreement at all, so that a = d = 0, then po will equal 0, and kappa will take on its minimum value of 1, i.e., [0 (1/2)]/[l (1/2)].

© 2002 by Chapman & Hall/CRC

Kappa is ordinarily used to measure concordance between two variables, i.e., two raters. If more than two raters (or processes) are under comparison, kappa indexes can be calculated for each separate pairwise agreement, i.e., for A vs. B, for B vs. C, for A vs. C, etc. A summary “group value” could be obtained as the median or other average of the individual kappa indexes. Alternatively, Fleiss9 has developed a mathematically complicated method that allows calculation of a single generalized kappa for three or more observers.

20.4.3.3 Quantitative Standards — Although stochastic P values can be calculated, the quantitative significance of kappa, i.e., its descriptive magnitude, is more important than the associated P value. Thus, P < .05 might be necessary to show that the results are stable, but the P values are otherwise useless for denoting meaningful degrees of observer agreement.

Kappa thus holds the distinction of being one of

 

 

 

 

 

 

 

 

 

LANDIS

the few descriptive indexes that has received seri-

 

 

 

 

 

 

 

 

 

 

FLEISS

KAPPA

& KOCH

ous statistical efforts to develop standards for

 

 

 

 

1.

 

 

 

 

 

 

 

 

 

 

 

 

quantitative significance. For increments and cor-

 

EXCELLENT

 

 

 

ALMOST PERFECT

 

relation coefficients, quantitative standards have

 

 

 

 

 

.8

 

 

 

 

 

 

 

 

 

 

been proposed by clinical psychologists10 and clin-

 

 

 

 

 

 

 

SUBSTANTIAL

 

 

 

 

 

 

 

 

 

ical epidemiologists.11 As guidelines for the quan-

 

FAIR

 

 

 

 

 

 

 

 

 

 

 

.6

 

 

 

titative strength of agreement denoted by kappa,

 

TO

 

 

 

MODERATE

 

 

 

 

 

 

 

 

 

 

however, two sets of criteria have been proposed

 

GOOD

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

by statisticians: Landis and Koch12 and Fleiss.9 The

 

 

 

 

.4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

proposals are shown in Figure 20.3. The Fleiss

 

 

 

 

 

 

 

 

 

FAIR

 

guidelines seem “tougher” than the Landis–Koch

 

 

 

 

.2

 

 

 

 

 

 

 

 

 

 

guidelines. Fleiss demands values of Š .75 for

 

POOR

 

 

 

 

 

 

 

SLIGHT

 

excellent agreement, whereas Landis and Koch

 

 

 

 

.0

 

 

 

 

 

 

 

 

 

 

regard .61 .80 as substantial and .81 1.00 as

 

 

 

neg

 

 

almost perfect. Fleiss deems everything < .40 as

 

 

 

POOR

 

 

 

 

 

 

 

 

 

 

 

poor, a designation that does not occur for Landis

 

 

 

 

 

 

 

 

 

 

 

and Koch until kappa gets below 0.

 

 

 

 

 

-1.

 

 

Despite the disagreement in criteria for agree-

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ment, however, both sets of guidelines can be

FIGURE 20.3

 

 

 

 

 

 

 

 

 

applauded for their pioneering attempts to demar-

 

 

 

 

 

 

 

 

 

Scales for strength of agreement for Kappa, as proposed

cate descriptive zones of quantitative significance.

by Fleiss9 and by Landis and Koch.12

20.4.3.4 Problems in Distribution of Challenge — Values of kappa have important problems, however, that were not considered in the proposals for quantitative guidelines. The problems, first noted by Helena Kraemer,13 arise when the challenge contained in the research is maldistributed. The chance-expected penalty of kappa may then become unfair to raters who have excellent observational agreement but who are victims of unsatisfactory research architecture.

To illustrate this situation, consider the agree-

 

 

 

 

 

ment matrix in Table 20.4. The two radiologists

TABLE 20.4

 

 

 

 

have 97% agreement [= (102 + 3)/108] in desig -

 

 

 

 

nating the chest films of first-year university stu-

Ratings of Chest Films for 108 First-Year

 

dents as normal or abnormal. The expected chance

University Students

 

 

 

a g r e e m en t , h ow ev er, i s [ ( 1 0 3 ) ( 1 0 4 ) +

 

 

 

 

 

 

Radiologist A

 

 

(5)(4)]/1082 = .92. Using the formula (po – pc)/( 1

 

 

Radiologist B

Normal

Abnormal

Total

pc), kappa would be (.97 – .92)/(1 – .92) =

 

 

 

 

 

Normal

102

2

 

104

.05/.08 = .625. The value of .625 for kappa still

 

indicates an agreement that is “substantial”

Abnormal

1

3

 

4

TOTAL

103

5

 

108

 

 

(according to Landis–Koch) or “fair to good” (according to Fleiss), but the result is much less impressive than the former 97% agreement.

© 2002 by Chapman & Hall/CRC

Because this problem arises from a maldistribution of the challenge group, the two radiologists would need a different research design to allow the remorseless kappa to produce a more impressive value for their excellent agreement.

With expected agreement calculated as (f1n1 + f2n2) /N2, the penalty factor is substantially increased when f1n1 + f2n2 has relatively high values in relation to N2. In a special analysis,14 the penalty factor was shown to have its minimum values when the challenge group is distributed equally so that f1 = f2 = N/2 and n1 = n2 = N/2. Accordingly, the two radiologists would receive a better challenge if the test population contained roughly equal numbers of abnormal and normal films, rather than a predominantly

normal group.

 

 

 

 

For example, suppose the 108 challenge cases

TABLE 20.5

 

 

 

were distributed as shown in Table 20.5. In this

Ratings of Chest Films for 108 Selected Patients

situation, the proportional agreement would still

 

 

 

 

be 105/108 = 97%, but the value of f1n1 + f2n2

 

Radiologist A

 

would be (54)(55) + (54)(53) = 5832, in contrast

Radiologist B

Normal

Abnormal

Total

to the previous value of (103)(104) + (5)(4) =

Normal

53

2

55

10732. Kappa would rise to [(108)(105)

Abnormal

1

52

53

(5832)] /[(108)2 5832] = .94, and would indicate

TOTAL

54

54

108

much better agreement than before.

If the challenge was not suitably arranged before the research was done, the problem can still be managed, but the management requires replacing the single “omnibus” value of kappa by indexes of specific agreement in different zones.

20.4.4Directional Problems of an “Omnibus” Index

Regardless of whether kappa or either of the two other indexes is used, a single “omnibus” index cannot answer two sets of directional questions that regularly arise about agreement in zones and bias in raters. In dimensional data, the agreements may get better or worse with changes in the dimensional magnitudes for different zones of the data. For binary data, only two zones occur, but the proportions of agreement may differ in those zones.

This issue is neglected in any “omnibus” index that gives a single value for binary concordance, without distinguishing the two types of agreement for positive and negative ratings. In many evaluations of concordance, however, the goal is to answer two questions about agreement, not just one. Instead of a single summary statement, we may want special indexes to show how closely the raters agree separately on positive decisions and on negative decisions. This problem is particularly prominent, as noted later in Chapter 21, in citing accuracy for diagnostic tests. Omnibus indexes of diagnostic agreement have generally been avoided because they do not separate sensitivity and specificity, or accuracy for positive and negative tests.

20.4.4.1 Indexes of Specific Agreement — Several approaches have been proposed for determining indexes of specific agreement. The most useful for binary data is the specific proportionate

agreement for positive and for negative ratings, denoted as ppos and pneg.

If f1 and n1 are each observer’s total of positive decisions, the expected positive agreement can be estimated as (f1 + n1)/2. Because positive agreement occurs with a frequency of a, its proportion will be

pp os

=

a/[(f1

+ n1 )/2]

=

2a/(f1 + n1 )

[20.6]

The expected negative agreement will be (f2 + n2)/2 and its proportion will be

 

pn eg

=

d/[(f2

+ n2 )/2 ]

=

2d/(f2 + n2 )

[20.7]

For the two previous sets of challenges to radiologists, the value of ppos will be (2)(102)/ (104 + 103) =

.985 in Table 20.4 and (2)(53)/(55 + 54) = .97 in Table 20.5. The value of pneg will be (2)(3)/(4 + 5) =

.67 in Table 20.4 and (2)(52)/(53 + 54) = .97 in Table 20.5.

© 2002 by Chapman & Hall/CRC

For these reasons, when concordance is evaluated in a 2 × 2 agreement matrix, kappa is probably the best index to use if you are using only one; but the results should always be accompanied by values of ppos and pneg to indicate what is really happening. The special indexes for the two types of agreement can be obtained easily without any need for adjustments due to chance, because the “expected” value is used for each calculation.

20.4.4.2 McNemar Index of Bias for Binary Data — Proposing a simple index for biased disagreements in a 2 × 2 concordance matrix, such as Table 20.1, Quinn McNemar3 used the following reasoning: If the disagreements occur randomly, the total of b + c in the b and c cells of Table 20.1 should be split about equally between b and c. If the Yes/No disagreements occur much more often than No / Yes disagreements, or vice versa, the b and c values will be unequal.

The inequality is expressed proportionately in McNemar’s index

| b – c |

[20.8]

-----------------

(b + c)

 

The lowest possible value for this index is 0. It occurs when b = c, i.e., when the number of Yes/No paired ratings is the same as the number of No/Yes ratings. The highest possible absolute value, |1|, occurs either when b = 0, so that the index is (c)/(+c) = 1, or when c = 0, so that the index is b/b = +1. When either b = 0 or c = 0, all the disagreement is in one direction.

Thus, the closer McNemar’s index is to zero, the more likely are the two observers to be “unbiased” in their ratings. As the index approaches |1|, the observers have increasingly substantial differences in the way they disagree. If one of the observers is regarded as the “gold standard,” the McNemar index will be an index of inaccuracy rather than merely bias in directional disagreement.

In Table 20.2, the two observers disagree on 18 + 2 = 20 occasions. The pattern is Fail/Pass in 18 and Pass/Fail in 2. The McNemar index will be (2 18)/(18 + 2) = − 16/20 = − .8. This relatively high magnitude suggests a substantial difference among the examiners. (The McNemar index is stochastically evaluated with a special chi-square test in Section 20.8.1.) If Examiner A were regarded as the gold standard in evaluating candidates, the McNemar index would show whether Examiner B has biased inaccuracy. In the current situation, however, Examiner B merely seems “tougher” than A, failing more candidates than might be expected if the disagreements went equally in both directions.

The McNemar index was used15 to denote bias in the diagnosis of toxic shock syndrome when clinicians reviewed scenarios for a series of cases that were identical in history, physical findings, and laboratory tests, but different in sex, and in presence/absence of menstruation or use of tampons. The proportion of biased diagnoses for toxic shock syndrome rose progressively when the same clinical scenarios respectively included a woman rather than a man, a statement about menstruation rather than no statement, and specific mention of a tampon rather than no mention.

The disadvantage of the McNemar index is that the emphasis on disagreements eliminates attention to the rest of the results. For example, the two observers in Table 20.2 would have the same McNemar

index of .8 if they seldom agreed, so that the 2 ×

2 matrix was

 

3

2

, or if the disagreements were

 

 

 

 

 

 

18

4

 

 

190

2

 

an uncommon event among all the other concordances in an agreement matrix such as

 

8

175

.

 

 

 

 

 

 

 

 

20.5 Agreement in Ordinal Data

For the three or more rows and columns of ordinal data, the paired ratings can have different degrees of “partial” agreement.

20.5.1Individual Discrepancies

With g ordinal grades, the frequency counts for the two raters will form a g × g agreement matrix. A straightforward index of proportional agreement can be formed from the sum of frequencies for perfect

© 2002 by Chapman & Hall/CRC

agreement in cells of the downward left-to-right diagonal. For example, Table 20.6 shows concordance of intraoperative Doppler flow imaging vs. pre-operative biplane ventriculography in rating the severity of

TABLE 20.6

Severity of Mitral Regurgitation as Graded byVentriculography and by

Transesophageal Doppler Color Flow (TDCF) Imaging (Data from

Chapter Reference 16)

Severity

 

Severity in Ventriculography

 

 

in TDCF

0

1

2

3

4

Total

 

 

 

 

 

 

 

91

28

11

1

0

131

1

33

20

11

3

2

69

2

6

3

4

1

0

14

3

2

1

5

10

2

20

4

0

0

2

1

9

12

TOTAL

132

52

33

16

13

246

 

 

 

 

 

 

 

mitral regurgitation in 246 patients.16 Perfect agreements occur in (91 + 20 + 4 + 10 + 9) = 134 of the 246 cases, a proportion of 54%.

The disagreements in all the other cells can be managed in two ways. In one approach, they are counted merely as disagreements. In a better approach, however, they are weighted according to the degree of partial disagreement. For example, paired ratings of 0–1 in Table 20.6 would have less disagreement than the pair 0–2, which in turn would have less disagreement than 0–3.

20.5.2Weighting of Disagreements

At least three tactics can be used to create weights for partial disagreements.

20.5.2.1 Categorical-Distance Method — The most simple and commonly used procedure assigns a point for each unit of categorical distance from the diagonal cells of perfect agreement. The subsequent calculations are easier if the weights increase for increasing degrees of agreement, rather than disagreement, as shown in Table 20.7. With g ordinal categories, ranked as 1, 2, 3,…, g, the maximum disparity is in cells rated as (1,g) or (g,1). These cells are given weights of 0. The next worst disparities will be in cells rated as (2,g), (g,2), (1,g1), or (g1,1). These cells are given weights of 1. The grading process continues until the maximum possible weight, g1, is given for perfect agreement in the diagonal cells (1,1), (2,2), (3,3),…, (g,g).

TABLE 20.7

Scheme of Weights for One-Point Units of Agreement in Ordinal Categories

Ordinal

 

 

Ordinal Rating by A

 

 

Rating by B

1

2

3

4

g1

g

I

g1

2

g2

3

g3

4

g4

g1

1

g

0

g2 g1 g2 g3

2

1

g3 g2 g1 g2

3

2

g4

1

0

g3

2

1

g2

3

2

g1

4

3

...

4

...

g1

g2

3

...

g2

g1

© 2002 by Chapman & Hall/CRC