Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
4.18 Mб
Скачать

16

Measuring agreement

Learning objectives

When you have finished this chapter you should be able to:

Explain the difference between association and agreement.

Describe Cohen’s kappa, calculate its value and assess the level of agreement. Interpret published values for kappa.

Describe the idea behind ordinal kappa.

Outline the Bland–Altman approach to measuring agreement between metric variables.

To agree or not agree: that is the question

Association is a measure of the inter-connectedness of two variables; the degree to which they tend to change together, either positively or negatively. Agreement is the degree to which the values in two sets of data actually agree. To illustrate this idea look at the hypothetical data in Table 16.1, which shows the decision by a psychiatrist and by a psychiatric social worker (PSW) whether to section (Y), or not section (N), each of 10 individuals with mental ill-health. We would say that the two variables were in perfect agreement if every pair of values were the same.

Medical Statistics from Scratch, Second Edition David Bowers

C 2008 John Wiley & Sons, Ltd

182

 

 

CH 16 MEASURING AGREEMENT

 

 

 

 

 

Table 16.1 Decision by a psychiatrist and a psychiatric social worker

 

 

whether or not to section 10 individuals suffering mental ill-health

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Patient

1

2

3

4

5

6

7

8

9

10

 

 

 

 

 

 

 

 

 

 

 

 

 

Psychiatrist

Y

Y

N

Y

N

N

N

Y

Y

Y

 

PSW

Y

N

N

Y

N

N

Y

Y

Y

N

 

 

 

 

 

 

 

 

 

 

 

 

In practical situations this won’t happen, and here you can see that only seven out of the 10 decisions are the same, so the observed level of proportional agreement is 0.70 (70 per cent).

Cohen’s kappa

However, if you had asked each clinician simply to toss a coin to make the decision (heads – section, tails – don’t section), some of their decisions would probably still have been the same – by chance alone. You need to adjust the observed level of agreement for the proportion you would have expected to occur by chance alone. This adjustment gives us the chance-corrected proportional agreement statistic, known as Cohen’s kappa, κ :

κ =

(proportion of observed agreement proportion of expected agreement)

(1 proportion of expected agreement)

We can calculate the expected values using a contingency table in exactly the same way as we did for chi-squared (row total × column total ÷ overall total – see Chapter 14). Table 16.2 shows the data in Table 16.1 expressed in the form of a contingency table, with the psychiatrist’s scores in the rows, the PSW’s scores in the columns, and with row and column totals added. The expected values are shown in brackets in each cell.

Table 16.2 Contingency table showing observed (and expected ) decisions by a psychiatrist and a psychiatric social worker on whether to section 10 patients (data from Figure 16.1)

 

 

Psychiatric Social Worker

Totals

Expected

 

 

Yes

No

 

value:

Psychiatrist

Yes

4 (3)

2 (3)

6

(5 × 6)/10 = 3

No

1 (2)

3 (2)

4

 

 

 

 

Totals

5

5

10

 

We have seen that the observed agreement is 0.70, and we can calculate the expected agreement to be 5 out of 10 or 0.50.1 Therefore:

κ= (0.70 0.50)/(1 0.50) = 0.20/0.50 = 0.40

1We can expect the two clinicians to agree on ‘Yes’ three times, and ‘No’ two times, making five agreements in total.

 

COHEN’S KAPPA

183

Table 16.3 How good is the

 

agreement – assessing kappa

 

 

 

 

Kappa

Strength of agreement

 

 

 

 

0.20

Poor

 

0.21–0.40

Fair

 

0.41–0.60

Moderate

 

0.61–0.80

Good

 

0.81–1.00

Very good

 

 

 

 

So after allowing for chance agreements, agreement is reduced from 70 per cent to 40 per cent. Kappa can vary between zero (agreement no better than chance), and one (perfect agreement), and you can use Table 16.3 to asses the quality of agreement. It’s possible to calculate a confidence interval for kappa, but these will usually be too narrow (except for quite small samples) to add much insight to your result.

An example from practice

Table 16.4 is from a study into the development of a new quality of life scale for patients with advanced cancer and their families – the Palliative Care Outcome scale (POS) (Hearn et al. 1998). It shows agreement between the patient and staff (who also completed the scale questionnaires) for a number of items on the POS scale. The table also contains values of Spearman’s rs , and the proportion of agreements within one point on the POS scale. The level of agreement between staff and patient is either fair or moderate for all items, and agreement within one point is either good or very good.

Table 16.4 From a palliative care outcome scale (POS) study showing levels of agreement between the patient and staff assessment for a number of items on the POS scale. Reproduced from Quality in Health Care, 8, 219–27, courtesy of BMJ Publishing Group

 

 

 

 

 

 

Proportion

 

No of

Patient score

Staff score

 

Spearman

agreement

Item

patients

(% severe)

(% severe)

K

correlation

within 1 score

 

 

 

 

 

At first assessment: 145 matched assessments

 

 

 

 

Pain

140

24.3

20.0

0.56

0.67

0.87

Other symptoms

140

27.2

26.4

0.43

0.60

0.86

Patient anxiety

140

23.6

30.0

0.37

0.56

0.83

Family anxiety

137

49.6

46.0

0.28

0.37

0.72

Information

135

12.6

13.4

0.39

0.36

0.79

Support

135

10.4

14.1

0.22

0.32

0.79

Life worthwhile

133

13.6

16.5

0.43

0.54

0.82

Self worth

132

15.9

23.5

0.37

0.53

0.82

Wasted time

135

5.9

6.7

0.33

0.32

0.95

Personal affairs

129

7.8

13.2

0.42

0.49

0.96

 

 

 

 

 

 

 

184

CH 16 MEASURING AGREEMENT

Table 16.5 Injury Severity Scale (ISS) scores given from case notes by two experienced trauma clinicians to 16 patients in a major trauma unit. Reproduced from BMJ, 307, 906–9. by permission of BMJ Publishing Group

 

 

 

 

 

 

 

 

 

Case no.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Observer no.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

9

14

29

17

34

17

38

13

29

4

29

25

4

16

25

45

2

9

13

29

17

22

14

45

10

29

4

25

34

9

25

8

50

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Exercise 16.1 Do the highest and the lowest levels of agreement in Table 16.4 coincide with the highest and lowest levels of correlation? Will this always be the case?

Exercise 16.2 Table 16.5 is from a study in a major trauma unit into the variation between two experienced trauma clinicians in assessing the degree of injury of 16 patients from their case notes (Zoltie et al. 1993). The table shows the Injury Severity Scale (ISS) score awarded to each patient.2 Categorise the scores into two groups: ISS scores of less than 16, and of 16 or more. Express the results in a contingency table, and calculate: (a) the observed and expected proportional agreement; (b) kappa. Comment on the level of agreement.

A limitation of kappa is that it is sensitive to the proportion of subjects in each category (i.e. to prevalence), so caution is needed when comparing kappa values from different studies – these are only helpful if prevalences are similar. Moreover, Cohen’s kappa as described above is only appropriate for nominal data, as in the sectioning example above, although most data can be ‘nominalised’, like the ISS values above. In the next paragraph, however, I describe, briefly, a version of kappa which can handle ordinal data.

Measuring agreement with ordinal data – weighted kappa

The idea behind weighted kappa is best illustrated by referring back to the data in Table 16.5. The two clinician’s ISS scores agree for only five patients. So the proportional observed agreement is only 5/16 = 0.3125 (31.25 per cent). However, in several cases the scores have a ‘near miss’; patient 2, for example, with scores of 14 and 13. Other pairs of scores are further apart, patient 15 is given scores of 25 and 8! Weighted kappa gives credit for near misses, but its calculation is too complex for this book.

Measuring the agreement between two metric continuous variables

When it comes to measuring agreement between two metric continuous variables the obvious problem is the large number of possible values – it’s quite possible that none of them will be

2The ISS is used for the assessment of severity of injury, with a range from 0 to 75. ISS scores of 16 or above indicate potentially life-threatening injury, and survival with ISS scores above 51 is considered unlikely.

MEASURING THE AGREEMENT BETWEEN TWO METRIC CONTINUOUS VARIABLES

185

the same. One solution is to use a Bland-Altman chart (Bland and Altman 1986). This involves plotting, for each pair of measurements, the differences between the two scores (on the vertical axis) against the mean of the two scores (on the horizontal axis).

A pair of tramlines, called the 95 per cent limits of agreement, are drawn a distance of two sd above and below the zero difference line (where sd = standard deviations of the differences). If

HP-ABPM (DBP; mmHg)

30

20

10

0

–10

–20

–30

 

 

 

 

 

 

60

70

80

90

100

110

120

(HP+ABPM)/2 (DBP; mmHg)

The 95 % limits of agreement are drawn 2 s.d.s either side of the zero difference line (s.d. = standard deviation of the difference between the two measures). . .

... and for reasonable agreement, most of the points should lie between them.

Figure 16.1 A Bland-Altman chart to measure agreement between two metric continuous variables; diastolic blood pressure as measured by patients at home with a cuff-measuring device (HP), and as measured by the same patients using an ambulatory device (ABPM). Reproduced from Brit. J. General Practice, 48, 1585–9, courtesy of the Royal College of General Practitioners

186

CH 16 MEASURING AGREEMENT

all of the points on the graph fall between the tramlines, then agreement is ‘acceptable’, but the more points there are outside the tramlines, the less good the agreement. Moreover the spread of the points should be reasonably horizontal, indicating that differences are not increasing (or decreasing) as the values of the two variables increase.

An example from practice

The idea is illustrated in Figure 16.1, for agreement between two methods of measuring diastolic blood pressure (Brueren et al. 1998). In this example, there are only a few points outside the

± 2 standard deviation tramlines and the spread of points is broadly horizontal. We would assess this chart as suggesting reasonably good agreement between the two methods of blood pressure measurement.

The continuous horizontal line across the middle of the chart represents the mean of the differences between the two measures. Note that this is below the zero mark indicating some bias in the measures. It looks as if the ABPM values are greater on the whole than the HP values.

To sum up, two variables that are in reasonable agreement will be strongly associated, but the opposite is not necessarily true. The two measures are not equivalent; association does not measure agreement.

VIII

Getting into a Relationship