Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Ординатура / Офтальмология / Английские материалы / Principles Of Medical Statistics_Feinstein_2002

.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
25.93 Mб
Скачать

19.1.1.After plotting the graph on any kind of paper you want to use, try to guess (and explain your guess about) the reasons for the statistician’s statement.

19.1.2.Please carry out the computations necessary to determine the regression line, the value of r, and the test of stochastic significance; also draw the line on your graph. Please show all the intermediate computations, so that pathogenesis can be probed if you get a wrong result.

19.2.The data in the scattergraph shown in Figure E.19.2 were recently used to support the claim that “serum lipoprotein(a) levels are elevated in patients with early impairment of renal function” and that the “inverse correlation between serum lipoprotein(a) level and creatinine clearance” points to “decreased renal catabolism as a probable mechanism of lipoprotein(a) elevation in patients with early renal failure.” The study was done in 417 patients, referred consecutively to a hypertension clinic in Italy, who were

tested one week after antihypertensive drugs were withdrawn. An abnormal creatinine clearance, defined as < 90 mL/min per 1.73 m2 was found in 160 of the 417 patients.

Do you agree with the stated claim? Please justify your reasons.

 

2.5

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

Lipoprotein(a)

1.5

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

0.5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

log

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

- 0.5

 

 

 

 

 

 

 

 

 

- 1

 

 

 

 

 

 

 

 

 

25

50

75

100

125

150

175

200

225

 

 

Creatinine Clearance, mL/min per 1.73 m2

 

 

FIGURE E.19.2

Relation between creatinine clearance and log lipoprotein(a). A significant inverse correlation (r = 0.243; P < 0.0001) was seen. [Figure and legend taken from Chapter Reference 24.]

19.3.The text of Chapter 19 contained no comments about one-tailed or two-tailed interpretations for the P values associated with b or r; and most statistical writers and texts do not discuss the subject. Nevertheless, investigators regularly do research with the anticipation that the slope will definitely go up (or down), and that the correlation will be definitely positive (or an inverse negative). Do you believe that one-tailed interpretations should be allowed when a distinct direction has been specified in advance for the slope or correlation coefficient?

19.4.Here is another opportunity to draw conclusions from the physical examination of graphs. Figure E.19.4 is an exact reproduction of what appeared in a publication 25 on “racial differences in the relation

between blood pressure and insulin resistance.” The investigators studied “116 Pima Indians, 53 whites, and 42 blacks who were normotensive and did not have diabetes.” The Pima Indians “were recruited from subjects participating in a longitudinal study on the development of non-insulin dependent diabetes. The whites and blacks were recruited by advertising in the local community.” The white and black participants were required to have parents and grandparents who were correspondingly white or black. “Afro-Caribbean and blacks from countries other than the United States were excluded.” The selected groups were shown to be similar in mean age and blood pressure. From the results in Figure E.19.4, however, the authors drew the following conclusions: (1) “the Pima Indians had higher fasting plasma

©2002 by Chapman & Hall/CRC

insulin concentrations than the whites or blacks”; and (2) in whites, but not in Pima Indians or blacks, “mean blood pressure ... was significantly correlated with fasting plasma insulin concentration (r = .42) and [with] the rate of glucose disposal during the low dose (r = − 0.41) and high-dose (r = − 0.49) insulin infusions.” The authors concluded further that “a common mechanism, genetic or acquired, such as enhanced adrenergic tone, or a cellular or structural defect may constitute the link between insulin resistance and blood pressure in whites but not in other racial groups.”

Mean Blood Pressure (mm Hg)

 

 

 

Pima lndians

 

 

 

 

 

Whites

 

 

 

 

 

Blacks

 

 

 

120

 

 

r = -0.06

 

 

 

 

 

r = 0.42

 

 

 

 

 

r = -0.10

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

110

 

 

P = 0.54

 

 

 

 

 

P<0.001

 

 

 

 

 

P = 0.54

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

100

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

90

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

80

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

70

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

60

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

50

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

10

100

1000 10

100

1000 10

100

1000

Fasting Plasma lnsulin (pmol/liter)

120

r = -0.02

 

 

r = -0.41

 

 

 

 

r = -0.04

 

 

110

P = 0.83

 

 

P = 0.004

 

 

 

 

P = 0.79

 

 

 

 

 

 

 

 

 

 

 

 

 

100

 

 

 

 

 

 

 

 

 

 

 

90

 

 

 

 

 

 

 

 

 

 

 

80

 

 

 

 

 

 

 

 

 

 

 

70

 

 

 

 

 

 

 

 

 

 

 

60

 

 

 

 

 

 

 

 

 

 

 

50

 

 

 

 

 

 

 

 

 

 

 

8

16

24

32

8

16

24

32

8

16

24

32

Whole-Body Glucose Disposal during Low-Dose lnsulin lnfusion (mmol/min)

120

r = -0.04

 

 

 

r = -0.49

 

 

 

r = 0.02

 

 

 

110

P = 0.65

 

 

 

P<0.001

 

 

 

P = 0.93

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

100

 

 

 

 

 

 

 

 

 

 

 

 

90

 

 

 

 

 

 

 

 

 

 

 

 

80

 

 

 

 

 

 

 

 

 

 

 

 

70

 

 

 

 

 

 

 

 

 

 

 

 

60

 

 

 

 

 

 

 

 

 

 

 

 

50

36

54

72

96 18

 

 

 

 

 

 

 

 

18

36

54

72

96 18

36

54

72

96

Whole-Body Glucose Disposal during High-Dose lnsulin lnfusion (mmol/min)

FIGURE E.19.4

Relation between Mean Blood Pressure and Fasting Plasma Insulin Concentration (Top Panel) and Insulin-Mediated Glucose Disposal during Low-Dose (Middle Panel) and High-Dose (Bottom Panel) Insulin Infusions in Pima Indians, Whites, and Blacks, after Adjustment for Age, Sex, Body Weight, and Percentage of Body Fat. The differences among the three groups in the slopes of the regression lines between mean blood pressure and the fasting plasma insulin concentration and insulinmediated glucose disposal during low-dose and high-dose insulin infusions were statistically significant (P = 0.001, 0.017, and 0.025 for Pima Indians, whites, and blacks, respectively). [Figure and legend taken from Chapter Reference 25.]

Considering only the graphic evidence, and avoiding discussion of the proposed biologic or physiologic mechanisms, do you think the investigators’ conclusions are justified?

19.5. The four figures marked E.19.5.1 through E.19.5.4 are taken directly from published medical reports. In each instance, the author(s) claimed that a significant relationship had been demonstrated. Except for methodologic details describing the measurements and groups, the text offered no statistical

© 2002 by Chapman & Hall/CRC

information about these relationships, beyond what is shown in the figures. (In one instance, r values reported in the text have been added to the figures.)

For Exercises 19.5.1 through 19.5.4, comment on each of these four analyses, figures, and conclusions. Do you think the authors are justified or unjustified? If you do not like what they did, what would you offer as an alternative?

 

8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

FIREARM HOMICIDE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

7

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PERSONS/YEAR

4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Y = 2.34 + 2.06X

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

r = .913

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

DEATHS/100,000

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

8

 

FIREARM SUICIDE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

7

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Y = 4.48 + 1.06X

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

r = .937

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.0

0.5

1.0

1.5

2.0

2.5

3.0

DOMESTICALLY PRODUCED HANDGUNS (Millions)

FIGURE E.19.5.1

Handgun availability and firearm mortality: United States, 1946–85. [Figure and legend taken from Chapter Reference 26.]

ˆ

19.6. Figure E.19.6 shows a scatter-plot of residuals in which the values of Yi – Yi have been plotted

ˆ

against the values of Yi for a group of 12 bivariate points. What conclusion would you draw from this pattern? If you are unhappy with the problem, what solution would you propose?

19.7. Figure E.19.7 shows the published points and corresponding regression lines for a plot of serum copper vs. pleural fluid copper in three diagnostic categories of patients.

19.7.1.In the text, the authors stated that for the 120 patients with malignant disease the value of r = 0.19 was statistically significant at P < 0.05. Are you convinced that this

significance is biologically important? If so, why? If not, why not?

19.7.2.A regression line has been drawn for each of the three cited equations. How can you quickly (by eye test alone) get supporting evidence that these lines correctly depict the stated equation?

© 2002 by Chapman & Hall/CRC

Plasma Free Fatty Acid ( Eq/l)

NON-OBESE

OBESE

1200

NIDDM

1050

Normal

 

900

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

750

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

600

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

450

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

300

 

 

 

 

 

 

 

r = 0.64

 

 

 

 

 

 

 

r = 0.74

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

150

 

 

 

 

 

 

 

p <0.001

 

 

 

 

 

 

 

p <0.001

 

0

40

80

120

160

200

240

280

320

360

0

40

80

120

160

200

240

280

320

360

0

Plasma Glucose (mg/dl)

FIGURE E.19.5.2

Relationship between fasting plasma glucose and FFA concentrations in nonobese and obese persons with normal glucose tolerance (open circles) or NIDDM (solid circles). [Figure and legend taken from Chapter Reference 27.]

PGIM

ng/g creatinine

FIGURE E.19.5.3

CORRELATION FOR MABP AND PGIM

150

n = 7

R2 = 0.623

r = -0.789

p <0.05

100

50

0

115

120

125

130

110

MEAN ARTERIAL BLOOD PRESSURE (MABP)

Correlation of mean arterial blood pressure (MABP) and measured PGI2 metabolite excretion (PGIM) in seven essential hypertensive patients receiving no medications. Each patient underwent three separate determinations for MABP and PGIM (during control periods 1 and 2 and during the placebo period) for a total of seven independent observations. Regression analysis was performed with the Statistical Analysis System using analysis of variance. [Figure and legend taken from Chapter Reference 28.]

© 2002 by Chapman & Hall/CRC

Interleukin-2, u/ml

1100

900

700

500

300

100

10

20

30

40

50

60

SPI

FIGURE E.19.5.4

Interleukin-2 serum level as measured by enzyme-linked immunosorbent assay in 81 sera samples. Data represent the mean of quadruplicate values. [Figure and legend derived from Chapter Reference 29.]

Residuals from the regression

2.0

0.0

-2.0

2.00

4.00

6.00

0.00

Predicted value of y from the regression line

FIGURE E.19.6

[Figure and legend taken from Chapter Reference 30.]

© 2002 by Chapman & Hall/CRC

 

 

 

 

 

 

 

 

 

^

 

 

 

 

 

 

 

 

 

Benign (n=63),y=130.464 + 0.212x

 

 

 

 

 

 

 

 

 

 

 

^

 

 

 

 

 

 

 

Lymphoma (n=21),y=47.784 + 1.148x

 

300

x

 

 

 

 

 

 

 

 

 

^

 

 

 

 

 

 

Other malig. (n=99),y=119.741 + 0.410x

 

 

 

 

 

 

 

 

 

x

x x

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

x

g/dl

 

 

 

 

 

 

 

 

 

 

 

 

 

 

x

x

200

 

 

 

 

 

 

 

 

x

x

x

x x

 

 

x

 

 

 

 

 

 

 

 

x

 

 

x

 

 

 

 

 

 

x

 

 

 

x

 

 

 

 

 

 

 

 

xx

 

xx

x

x

 

 

 

 

 

 

 

x

 

 

 

x

x

 

x x

 

Cu,

 

 

x

x x

 

 

x xx

x

xx

 

x

 

 

x

 

 

x x

 

 

 

 

 

 

 

x

 

 

xxxx

x

 

x

x

x

x

 

 

 

 

 

x xx

x xx

 

 

 

 

 

 

 

x

x x x xx

 

x x

 

x

xx x

x

 

 

x

 

 

 

x

xxx xx

 

x

 

x

 

 

 

 

Serum

 

 

 

x

x

 

 

x

 

 

 

 

x

 

 

 

 

 

 

x

 

 

 

 

 

 

 

 

x

 

x

 

 

xx

 

x

 

 

 

 

 

 

 

100

x

 

x

 

 

 

 

 

 

 

 

 

 

 

 

x

 

 

 

 

 

 

 

 

x

 

 

x

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

x

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

40

 

 

80

 

 

 

 

120

 

160

 

 

 

 

 

 

 

 

 

Pleural fluid Cu, g/dl

FIGURE E.19.7

Positive relationship between pleural fluid copper and serum copper in the group with malignant disease. [Figure and legend taken from Chapter Reference 31.]

© 2002 by Chapman & Hall/CRC

20

Evaluating Concordances

CONTENTS

20.1Distinguishing Trends from Concordances

20.2Conformity vs. Agreement

20.3Challenges in Appraising Agreement

20.3.1Goals of the Research

20.3.2Process, Rater, and Observer

20.3.3Number of Raters

20.3.4Types of Scale

20.3.5Individual and Total Discrepancies

20.3.6Indexes of Directional Disparity

20.3.7“Adjustment” for Chance Agreement

20.3.8Stability and Stochastic Tests

20.4Agreement in Binary Data

20.4.1Proportion (Percentage) of Agreement

20.4.2φ Coefficient

20.4.3Kappa

20.4.4Directional Problems of an “Omnibus” Index

20.5Agreement in Ordinal Data

20.5.1Individual Discrepancies

20.5.2Weighting of Disagreements

20.5.3Proportion of Weighted Agreement

20.5.4Demarcation of Ordinal Categories

20.5.5Artifactual Arrangements

20.5.6Weighted Kappa

20.5.7Correlation Indexes

20.6Agreement in Nominal Data

20.6.1Substantively Weighted Disagreements

20.6.2Choice of Categories

20.6.3Conversion to Other Indexes

20.6.4Biased Agreement in Polytomous Data

20.7Agreement in Dimensional Data

20.7.1Analysis of Increments

20.7.2Analysis of Correlation

20.7.3Analysis of Intraclass Correlation

20.8Stochastic Procedures

20.9Multiple Observers

20.9.1Categorical Data

20.9.2Dimensional Data

References

Exercises

A striking feature of modern medical statistics has been the relative absence of attention to the scientific quality of raw data. Despite the many methods developed for sampling, receiving, and analyzing data, and for drawing conclusions about importance or “significance,” the scientific suitability, accuracy, and reproducibility of the basic information has not been a major focus of concern.

© 2002 by Chapman & Hall/CRC

The contemporary statistical emphasis on inference rather than evidence is particularly ironic because problems of “observer variability” were the stimulus about 150 years ago that made C. F. Gauss develop his “theory of errors” in describing the “normal” distribution of deviations from the “correct” value (i.e., the mean) of a measurement. Analogous challenges in “quality control” for chemical variations in beer at the Guinness brewery were the stimulus almost 100 years ago for W. S. Gosset’s activities that are now famous as the Student t test. After Gaussian and Gossetian theory were established, however, statistical creativity became devoted more to quantity in mathematical variance than to quality in measurement variability.

Until about two decades ago, W. Edwards Deming, a statistician who gave imaginative attention to industrial methods of achieving “quality control,” was generally little known or heralded in Englishspeaking academic enclaves. Nevertheless, Deming’s methods are particularly famous in Japan, where they were adopted after World War II and became a keystone of Japanese success in developing high quality products with modern industrial technology.

Because variability in scientific measurement1 is still distressingly alive, well, and flourishing, the statistical methods of describing the problems have often come from investigators working directly in the pertinent scientific domain. Examining observer variability among radiologists,2 J. Yerushalmy, an epidemiologist, devised the indexes of sensitivity and specificity now commonly used in the literature of diagnostic tests. Indexes of biased observation,3 a special chi-square test for observer variability,4 and the commonly used kappa coefficient of concordance5 were contributed by two psychologists, Quinn McNemar and Jacob Cohen. Statistical methods for describing accuracy and reproducibility in laboratory data have also been developed mainly by workers in that field,6 although R. A. Fisher’s intraclass correlation coefficient has sometimes been applied for laboratory measurements. [Fisher originally proposed7 the intraclass coefficient, however, to compare results in pairs of brothers, not to analyze diverse measurements of the same entity.]

Lacking the mathematical “clout” of statistical theory, the pragmatic challenges of measurement variability are omitted from many textbooks of medical, biologic, and epidemiologic statistics. In many textbook discussions of “association,” in fact, trends in measurements of different variables are not distinguished from concordances in different measurements of the same entity.

This chapter is intended to outline some of the main distinctions and challenges in statistical appraisals of concordance and to discuss the diverse indexes of concordance that appear in medical literature.

20.1 Distinguishing Trends from Concordances

The idea of interchangeability is what separates concordance from trend. Trends are assessed to determine whether two different variables are co-related, i.e., whether they “go along together.” For concordances, however, the goal is to see whether Variable A can be interchangeably substituted for Variable B. An index of trend may have a high correlation value when derived from the proportionate

ˆ

reduction in group variance for sums of squared errors, but the estimates of Yi may not be good enough to be used as direct substitutes for Yi.

To illustrate the problem, consider the four data sets in the table below. The values of X are the same in each set, but the corresponding values of Y — in the columns marked YA,YB,YC , and YD — represent different methods of measuring X.

X

YA

YB

YC

YD

3

3

7

6

4

4

4

8

8

3

5

5

9

10

6

6

6

10

12

5

7

7

11

14

8

8

8

12

16

7

9

9

13

18

10

10

10

14

20

9

 

 

 

 

 

© 2002 by Chapman & Hall/CRC

The graphs of the four sets of points and the corresponding fitted lines are shown in Figure 20.1.

If assessed for trend, the regression line fits the points perfectly for each of groups YA , YB, and YC, and the line for Group YD also fits very well. The r values are 1 for the first three lines, and close to 1 for the fourth. Despite the almost identical high marks for trend, however, only the YA line has close agreement; and in the three other lines, the corresponding Y values never agree with those of X. InY B , the values are always 4 units higher; in YC, they are doubled; and in Y D , they are alternatingly one unit higher or lower.

None of these disagreements is shown, however, by

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

indexes of trend. The intercepts are different for lines YB

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

20

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

and YC, but similar for lines YA and YD ; the slopes are

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

identical in lines YB and YD ; and the correlation coeffi-

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

YC

cients are extremely high in all three of the disagreeing

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

lines for YB, YC , and YD . Because the customary statis-

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

tical indexes of trend are not satisfactory, agreement must

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

be described with a different set of indexes, aimed at

 

15

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

appraising concordance.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

YB

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

YD

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

20.2 Conformity vs. Agreement

 

10

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Although the orientation is either dependent or nonde-

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

pendent for assessing trend, the orientation of a concor-

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

YA

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

dance is aimed at either conformity or agreement. In

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

conformity, the goal is to see how closely the observed

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

measurement conforms to a “correct result,” which is

 

5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

available as the “reference,” “criterion,” or “gold stan-

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

dard” value for each measurement. This concept is usu-

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ally regarded as accuracy, but is best called conformity.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The idea of accuracy is pertinent for technologic mea-

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

surements, but cannot be readily applied to assess other

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

types of discrepancy, such as whether a clinician’s deci-

 

0

 

2

4

 

6

 

8

10

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

sions comply with the criteria established by an audit

FIGURE 20.1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

committee. Conformity is also preferable because the

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Graph of the four sets of data in Section 20.1.

“gold standard” may sometimes change. Thus, the “cor - rect” answer to the question in a certifying examination ten years ago may no longer be correct today.

Agreement, however, is assessed without a “gold-standard” criterion. We determine how closely two observations agree, but not whether they are correct. For example, suppose two radiologists independently decide whether pulmonary embolism is present or absent in each of a series of chest films. If no information is available about the patients’ true conditions, we can assess only the radiologists’ agreement. If additional data indicate whether each patient did or did not actually have pulmonary embolism, we can determine each radiologist’s accuracy (i.e., conformity with the correct diagnosis) as well as the agreement between them.

When pathologists provide “readings” of histologic specimens, we can assess only the agreement of the observers, unless one of the pathologists is accorded the deified status of always being correct. Two pathologists’ readings of cytologic specimens (such as pap smears) would usually be assessed for agreement, but accuracy could also be checked if an appropriate histologic decision were available for each smear.

Although studies of observer variability will indicate disagreements among “equal” observers, studies of conformity are done with a “gold standard.” For laboratory measurements, the gold standard is provided by a selected reference laboratory or a national Bureau of Standards. For the categorical measurements used in certifying examinations or audits of health care, the “gold standard” is provided by an individual expert or consensus of designated experts.

© 2002 by Chapman & Hall/CRC

In a particularly common type of conformity research, the gold standard for a diagnostic marker test is the definitive diagnosis of the selected disease. Diagnostic marker tests and other aspects of conformity are appraised in so many different ways that the topic will receive a separate discussion in Chapter 21. The rest of this chapter is devoted to evaluating agreement.

In the specialized jargon developed for describing the ideas, the results of a study of agreement are sometimes called reproducibility, repeatability, reliability, or consistency. The first two terms are inadequate if each rater has given only a single rating, and the third term is unsatisfactory because reliability is an idea that generally connotes “trustworthiness” beyond mere agreement alone. To describe the general concept, consistency is probably the best of the four terms, but fortunately, the assessment of individual (rather than group) agreements is usually called agreement (or concordance). Indexes of agreement or disagreement will be needed for the individual results and total group patterns that can occur in the four main types of rating scales.

20.3 Challenges in Appraising Agreement

Appraising agreement involves a new set of statistical challenges that arise uniquely when discrepancies are noted and summarized for measurements of the same entities.

20.3.1Goals of the Research

An important consideration before the work is done is to establish the goals of the research. Is it intended to expose and quantify the state of disagreement among the raters, or is the main goal to improve the state of the observational art? In most published studies of concordance, the work seems aimed merely at quantifying disagreement. The investigator does the research, presents the report, and departs in an air of sagacious revelation — but nothing happens thereafter. The disagreement itself becomes exposed, but whatever was causing it is neither discovered nor repaired.

The distinction between “diagnostic” demonstration and “therapeutic” improvement is shown in the names commonly used for the research. When disagreement is investigated for laboratory measurements, the work is usually called quality control. The revealed disagreements are confronted and carefully explored; the methods of measurement are checked and improved; and the eventual improvements elevate the quality of the measurement process. In most investigations of disagreement among clinicians, radiologists, and pathologists, however, the work is usually called observer variability. Little or no effort is made afterward to remove the defects that have been revealed.

This apparent complacency is probably due to the difficulty of arranging suitable analytic confrontations. In laboratory measurements, the procedures can usually be clearly delineated, so that sources of variation can easily be sought among the component steps of the observational process. For clinicians, radiologists, and pathologists, however, the component steps of the procedure are not clearly discerned when the result is merely stated as a rating — such as systolic murmur, 1+ enlargement, or poorly differentiated adenocarcinoma — that emerges from a complex act of observation and decision. To determine component steps and criteria for the observational decisions, the observers must meet together, confront the disagreements, and identify the constituent steps of the process. These analytic confrontations may be difficult to arrange because of problems in getting the observers assembled, and also because many observers may dislike the confrontational process and its possible departures from “diplomacy.”

If the goal is to improve rather than merely document the problems of observer variability, however, the investigator may have to plan for much more than just getting and comparing the stated ratings. If the observers cannot be assembled for direct confrontation, the solicited information should include attention not only to the “final” ratings, but particularly to the intricate components and criteria of the observational process.

20.3.2Process, Rater, and Observer

If the entity being “measured” is a white blood count, serum calcium, chest film, liver biopsy, answer to an examination question, or clinical decision, the person or apparatus that produces the actual

© 2002 by Chapman & Hall/CRC