Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Ординатура / Офтальмология / Английские материалы / Principles Of Medical Statistics_Feinstein_2002

.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
25.93 Mб
Скачать

B always being higher than method A — that are promptly shown in direct examination of the increments.

A second problem is the way that the s2I term dominates the value of RI calculated with Formula [20.12]. With large variations in the group of people under study, s2I will have a large value, and RI will be relatively high regardless of how well or badly the raters perform in producing so2 . This distinction gives RI problems analogous to those of kappa in being greatly affected by the distribution of data in the study group.32

For multiple raters rather than two, RI becomes much more complex, because it is constructed in different ways for different analyses. The complexities, which involve components of the analysis of variance, will be discussed in Chapter 29.

Perhaps the greatest deterrent to using RI is the difficulty of understanding its construction and interpretation. Many authors have written about RI, using different formulations, symbols, and interpre - tations, even for the simple set of two-observer data in Table 20.11. If you intend to use this approach, or want to understand the results, get help from an appropriately knowledgeable and communicative statistician.

Perhaps the last main point to be noted before we leave RI is that it would seem to be most pertinent pragmatically in quality control studies of laboratory measurements. Nevertheless, R I seldom appears in the literature of laboratory medicine. Perhaps the investigators have already discovered that RI does not offer an optimum approach to the challenges.

20.8 Stochastic Procedures

As noted earlier, the descriptive indexes are almost universally acknowledged as the main entity to be considered in evaluating concordance. Consequently, P values and/or confidence intervals seldom appear unless the group sizes are particularly small. Nevertheless, various chi-square procedures have been applied for stochastic tests, and the McNemar test is particularly well known. The other procedures, which lead to Z tests for kappa and weighted kappa, are briefly mentioned so you will have heard of them.

The McNemar chi-square test warrants special attention because it is regularly used for 2 2 tables that express change as well as agreement. To get a stochastic index for the agreement matrix in Table 20.1, McNemar used the following reasoning: Under the null hypothesis, the b and c cells can be expected to have equal values, which would be (b + c)/2. A goodness-of-fit chi-square test between the observed and expected values can be calculated as

 

 

b + c 2

 

b + c 2

2

b –

-----------2

c –

-----------2

XM

= ----------------------------

 

+ ---------------------------

 

 

b + c

b + c

 

-----------

2

-----------

 

 

 

2

With suitable algebraic expansion and collection of terms, this expression becomes

2

=

(b – c)2

XM

--------------- [20.13]

 

 

b + c

which can be interpreted in a chi-square table with 1 degree of freedom. A continuity correction can be incorporated to make the working formula become

2

=

(|b – c| – 1)2

XMC

----------------------------- [20.14]

 

 

b + c

© 2002 by Chapman & Hall/CRC

McNemar recommended that the continuity correction be used when (b + c) < 10.

The McNemar index and stochastic test, which will reappear later in Chapter 26 when “matched” arrangements are discussed for case-control studies, have been used33 to compare rates of agreement between patients and surrogates about preferences for different forms of life-sustaining therapy.

The conventional X2 test can be applied whenever the descriptive results are expressed either in ordinary (unweighted) proportions of agreement or with the φ coefficient. In a 2 × 2 agreement table, however, the McNemar test is often preferred.

Agreement in polytomous matrixes can be tested stochastically with an extension of the McNemar test, called the Bowker X2 test for off-diagonal symmetry. The test is well described, with a worked example, in the textbook by Sprent.34

The stochastic procedure for kappa uses its standard error to form either a confidence interval or a Z statistic from which a P value is determined. The formula for calculating the standard error of kappa is shown with an illustrative example in Fleiss.9

Fleiss9 also shows the calculation of a standard error for weighted kappa. The standard error is used for a confidence interval or a Z statistic.

The paired t test can be used for pairs of dimensional data.

20.9 Multiple Observers

The last topic in this long chapter is the problem of analyzing results from multiple observers. In the many indexes and strategies just discussed, two (paired) ratings are compared for each entity. Sometimes, however, more than two ratings may be available. For example, in studying observer variability in mammography, Elmore et al.20 appraised the diverse readings offered by 10 radiologists for each of 150 sets of mammograms.

The strategy that seems most scientifically sensible and easy to understand is to arrange the multiple ratings into pairs of raters, to calculate indexes of concordance for each pair of raters, and then to determine an average result. Thus, for four raters, we might determine kappa indexes for rater A vs. B, A vs. C, A vs. D, B vs. C, B vs. D, and C vs. D. An overall result, if desired, could be the average (as a median or mean) of the six kappa indexes.

20.9.1Categorical Data

Because the statistical challenge is irresistible, various proposals have been offered to determine an overall generalized index for m raters, each offering n ratings for a set of categorical data. The methods are discussed and demonstrated by Fleiss.9 Kendall’s coefficient W for associating m sets of rankings is presented and illustrated by Sprent34 and also by Siegel and Castellan.35

20.9.2Dimensional Data

For dimensional data, each of the m raters is regarded as a “class,” and the m dimensional values for each of the n rated entities receive a “repeated measures analysis of variance” that leads to the intraclass correlation coefficient RI . The analysis-of-variance strategy used for RI will be discussed in Chapter 29.

© 2002 by Chapman & Hall/CRC

References

1. Elmore, 1992; 2. Yerushalmy, 1969; 3. McNemar, 1955; 4. McNemar, 1947; 5. Cohen, 1960; 6. Barnett, 1979; 7. Fisher, 1941, pg. 213; 8. Lorenz, 1994; 9. Fleiss, 1981; 10. Cohen, 1977; 11. Burnand, 1990;

12.

Landis, 1977; 13. Kraemer, 1979; 14.

Feinstein, 1990c; 15.

Harvey, 1984; 16. Sheikh, 1991; 17. Maclure,

1987; 18. Cicchetti, 1976; 19. Kramer,

1981; 20. Elmore,

1994b; 21. Elmore, 1997; 22. Reger, 1974;

23.

Hourani, 1992; 24. Feinstein, 1970; 25. Fendrich, 1992; 26. Friederici, 1984; 27. Loewenson, 1972; 28.

Dyer, 1994; 29. Bland, 1986; 30. Mahalanobis, 1940; 31. Robinson, 1957; 32. Bland, 1990; 33. Sulmasy, 1994; 34. Sprent, 1993; 35. Siegel, 1988; 36. Edmunds, 1988; 37. Saunders, 1980.

Exercises

20.1. Table E.20.1 reports two respiratory measurements with each of two flow meters on 17 subjects. The investigator’s goal was to see whether the more complex Wright flow meter could be replaced with a simpler-and-easier-to use mini flow meter. [Data and figures taken from Chapter Reference 29.]

TABLE E.20.1

PEFR Measured with Wright Peak Flow and Mini Wright Peak Flow Meter

 

Wright peak flow meter

 

Mini Wright Peak Flow Meter

 

First PEFR

Second PEFR

 

First PEFR

Second PEFR

Subject

(1/min)

(1/min)

 

(1/min)

(1/min)

 

 

 

 

 

1

494

490

512

525

2

395

397

430

415

3

516

512

520

508

4

434

401

428

444

5

476

470

500

500

6

557

611

600

625

7

413

415

364

460

8

442

431

380

390

9

650

638

658

642

10

433

429

445

432

11

417

420

432

420

12

656

633

626

605

13

267

275

260

227

14

478

492

477

467

15

178

165

259

268

16

423

372

350

370

17

427

421

451

443

 

 

 

 

 

 

20.1.1.What would you check to see whether each flow meter yields essentially the same results (i.e., “intra-observer variability”) in its two measurements for each subject? Which flow meter seems inherently more “variable”?

20.1.2.Suppose the investigator, using only the first measurement for each subject, compares the results as shown in Figure E.20.1. For these data, r = .94 with P < .001. [For the

questions that follow, use only the first “PEFR” for each method of measurement.]

(a)From visual inspection of the graph, would you be impressed that the high r value shows excellent agreement? If not, why not?

(b)What could you do quantitatively to check the excellence of the agreement?

(c)What would you check to see whether the two measuring systems are biased

©2002 by Chapman & Hall/CRC

with respect to one another or to the magnitudes of PEFR? If your check involves calculations, show the results and your conclusions.

FIGURE E.20.1

PEFR measured with large Wright peak flow meter and mini Wright peak flow meter, with line of equality.

PEFR by Mini meter (1/min)

800

700

600

500

400

300

200

100

0

0

100

200

300

400

500

600

700

800

 

 

PEFR by Large meter (1/min)

 

 

20.2. Table E.20.2 shows dodeciles of “sucrose intake” as reported in two questionnaires repeated at a one-year interval.17

20.2.1.Form a 2 × 2 table by dividing the orig inal 12 × 12 table between the 6th and 7th dodeciles. What are the values of proportional agreement and kappa in the 2 × 2 table?

20.2.2.What are the values of ppos and pneg in the 2 × 2 tabl e? Do these results convince or dissuade you for the belief that kappa is a good index of concordance here?

20.2.3.Form a 3 × 3 table by dividing between the 4th and 5th and 8th and 9th deciles. What happens to proportional agreement? Using the categorical-distance method, what is the value of weighted proportionate agreement?

20.2.4.A compassionate instructor saves you from having to slog through the calculations and

tells you that weighted kappa (with the categorical-distance method) is 0.46 for the 3 × 3 table. How does this compare with the value previously obtained for the unweighted (2 × 2) kappa? How do you account for the difference?

Comment: In the exercises that follow, you are not expected to do any sophisticated calculations such as kappa. Your conclusions should come mainly from visual inspection and “clinical judgment,” although you should feel free to check minor arithmetical details, such as sums.

20.3. In a paper on open-heart surgery for aortic valve and / or coronary disease in 100 consecutive octogenarians,36 the authors concluded that “operation may be an effective option for … selected octogenarians with unmanageable cardiac symptoms.” Symptoms were classified and tabulated as shown in Figure E.20.3A.

20.3.1The authors say that 90 patients were in Class IV for either the NYHA or CCS

classifications. What is the source of this number and do you agree with it?

20.3.2.Are you satisfied that the patients in Class IV (with either rating scale) have been suitably classified? If not, why not?

20.3.3.In Figure E.20.3B, the authors report the “current functional and ischemic status of the 54 patients who still remain alive” in follow-up durations that are at least one year for all patients and as long as six years for a few. Are you satified with the classifications

and results offered in Figure E.20.3B? Is there anything else you might like to know to determine which patients were most benefited by the operation, and whether the authors’ claim is justified that the operation is beneficial for “octogenarians with

© 2002 by Chapman & Hall/CRC

TABLE E.20.2

Cross-Classification of Subjects by Dodeciles of Sucrose Intake Measured by a Food Frequency Questionnaire Administered Twice, One Year Apart, to 173 Boston-Area Female Registered Nurses Aged 34–59 Years in 1980–1981 [Taken from Chapter Reference 17.]

Second

 

 

 

 

First Questionnaire Dodeciles

 

 

 

 

Questionnaire

 

 

 

 

 

 

 

 

Dodeciles

1

2

3

4

5

6

7

8

9

10

11

12

 

 

 

 

 

 

 

 

 

 

 

 

 

1

7

4

0

1

1

0

1

0

0

0

0

0

2

3

3

4

0

3

0

0

0

0

0

1

0

3

1

0

2

2

2

3

1

2

1

0

1

0

4

1

3

2

3

3

0

1

1

1

0

0

0

5

0

2

1

2

1

5

0

1

2

0

0

0

6

0

1

0

3

1

3

3

3

0

0

1

0

7

1

0

1

0

1

2

3

0

3

1

1

1

8

0

0

2

1

2

1

0

1

2

5

0

1

9

1

1

2

0

0

0

1

2

2

1

3

1

10

0

0

0

1

0

0

4

2

3

1

1

3

11

0

1

0

0

0

1

0

1

0

5

1

5

12

0

0

0

1

0

0

0

1

1

2

5

4

 

 

 

 

 

 

 

 

 

 

 

 

 

unmanageable cardiac symptoms?”

 

 

 

 

 

N.Y.H.A.

 

 

 

 

 

 

Classification

 

 

 

 

I

II

 

III

IV

total

 

 

 

 

 

 

 

 

 

 

I

-

-

 

3

 

28

31

 

 

 

 

 

 

 

 

 

 

II

-

-

 

2

 

8

10

C.C.S.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Classification

III

2

1

 

2

 

4

9

of

 

 

 

 

 

 

 

 

Angina

 

 

 

 

 

 

 

 

IV

30

6

 

3

 

11

50

 

 

 

 

 

 

 

 

 

 

 

 

 

total

32

7

 

10

 

51

100

 

 

 

 

 

 

 

 

 

FIGURE E.20.3A

Matrix of Symptoms in All 100 Patients Who Underwent Open-Heart Surgery. Each patient was classified according to both the New York Heart Association (N.Y.H.A.) classifications of functional disability and the Canadian Cardiovascular Society (C.C.S.) classifications of severity of effort angina. Patients who did not have angina were included in C.C.S. Class I, since none of them could exercise strenuously. [Taken from Chapter Reference 36.]

 

 

 

N. Y. H. A.

 

 

 

 

Classification

 

 

 

I

II

III

TOTAL

 

I

33

12

1

46

C. C. S.

 

 

 

 

 

 

 

 

 

 

Classification

II

3

5

 

8

of

 

Angina

 

 

 

 

 

 

TOTAL

36

17

1

54

 

 

 

 

 

 

FIGURE E.20.3B

Functional and Anginal Classification of the 54 Living Patients. [Taken from Chapter Reference 36.]

20.4. Figure E.20.4 shows results of a simpler, speedier electrophoresis method than the standard (“modified K-L columns”) method for measuring glycosylated hemoglobin (HbAl). The authors said the new method was “accurate,” but offered no comparative information except what appears in Figure E.20.4 and its legend.

20.4.1.What is the meaning of “S2y . x = 1.04” on the graph?

© 2002 by Chapman & Hall/CRC

20.4.2.Do you agree with the authors’ claim that the new method is “accurate”? If not, why not?

20.5.In a study of therapeutic outcome as rated by patients and their psychotherapists, the following frequency counts were reported for 37 patients:

 

Rating by Therapist

 

 

 

 

 

Rating by Patient

Satisfactory

Unsatisfactory

Total

 

 

 

 

Satisfactory

19

1

20

Unsatisfactory

5

12

17

Total

24

13

37

 

 

 

 

The authors listed the stochastic analytic results exactly as follows:

 

X2 Value

P Value

Agreement

17.345

< .005

Change

2.667

NS

 

 

 

20.5.1.How do you think this stochastic analysis was conducted? Do you agree with it? If not, what would you propose instead?

20.5.2.What would you do to check whether the therapists were more optimistic than the patients?

 

25

 

 

 

 

(%)

20

S2y. x=1.04

 

 

 

 

 

, electrophoresis

15

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

HbA

 

 

 

 

 

 

10

 

 

 

 

 

5

10

15

20

 

 

5

25

 

 

 

HbA1, k-L columns (%)

 

 

FIGURE E.20.4

Relation between HbA1 concentrations measured by electrophoresis endosmosis and our own modified Kynoch-Lehmann (K-L) columns (n = 192). HbAt by electrophoresis 1.10 HbA1 by K-L0.64( —— Line of identity, - - - Regression line). [Taken from Chapter Reference 37.]

© 2002 by Chapman & Hall/CRC

21

Evaluating “Conformity” and Marker Tests

CONTENTS

21.1Concepts of Accuracy and Conformity

21.2Statistical Indexes of Diagnostic Efficacy

21.2.1Structure of a Decision Matrix

21.2.2Omnibus Indexes

21.2.3Problems in Expressing Rates of Accuracy

21.2.4Mathematical Conversions for Clinical Usage

21.2.5Bayesian Inference

21.2.6Direct Clinical Reasoning

21.3Demarcations for Ranked Results

21.3.1Binary “Gold Standards”

21.3.2Receiver-Operating-Characteristic (ROC) Curves

21.3.3Likelihood-Ratio Strategy

21.3.4Additional Expressions of Efficacy

21.3.5Trichotomous Clinical Strategy

21.4Stability of Indexes

21.5Combinatory and Multivariable Methods

21.5.1Simultaneous Tests

21.5.2Sequential Tests

21.5.3Multivariable Analyses

21.6Conformity in Laboratory Measurements

21.7Spectral Markers

21.7.1Use of Spectral Marker without “Gold Standard”

21.7.2Pre-Spective vs. Post-Spective Expressions

21.7.3Problems in Evaluation

21.8Scientific Problems in Diagnostic Statistics

21.8.1Surrogate vs. Other Diagnostic Roles

21.8.2Diagnostic Performance

21.8.3Spectral Composition

21.8.4Bias in Data and Groups

21.8.5Reproducibility

21.9Additional Challenges

21.9.1Diagnostic Evaluations and Guidelines

21.9.2Non-Diagnostic Roles

21.9.3Appraising Reassurance

21.9.4Automated Observations

21.9.5“DNA” Diagnostic Markers

21.9.6Neural Networks

21.9.7Constructing “Silver Standards”

References

Exercises

© 2002 by Chapman & Hall/CRC

As medical technology began to burgeon after World War II, the diagnostic accuracy of new procedures and tests required quantitative evaluation. As new therapeutic technology was developed, however, patient care began to offer many scientific challenges beyond diagnostic decisions alone. The additional decisions demanded that clinicians use the technologic information to estimate prognosis, choose and evaluate therapy, and appraise diverse conditions and changes. Nevertheless, most statistical methods for assessing technology have been devoted to the accuracy of diagnostic tests.

After a brief discussion of nomenclature, most of this chapter is devoted to the diagnostic statistical methods. Some of the additional nondiagnostic challenges in clinical care are noted afterward.

21.1 Concepts of Accuracy and Conformity

When the same entity is measured in two ways, the agreement in results can be called accuracy if one result is accepted as the correct entity, which is often called the reference, criterion, or gold standard. For chemical (and other laboratory) measurements, the criterion result may come from the National Bureau of Standards or from a particular laboratory designated as the reference standard. In the usual assessment of accuracy, a tested laboratory’s results for a measurement such as serum calcium are compared against the corresponding values obtained in the reference laboratory.

Many diagnostic activities, however, do not compare two measurements of exactly the same substance. Instead, the results of one variable, such as serum calcium, are used as a marker test to identify (or “predict”) the diagnosis of a disease, such as hypoparathyroidism, that is verified with other methods in the second, or “gold standard,” variable. In other activities, a marker test may be evaluated for efficacy rather than accuracy, because the gold-standard criterion may rely not on a single idea of “correctness,” but on a composite combination of costs, convenience, and consequences for right and wrong answers. The idea of accuracy may itself sometimes be uncertain, because the gold-standard criterion may not have enduring permanence. For example, the radiologic imaging procedure that is today’s “gold standard” might be replaced by a better technique tomorrow.

For all these reasons, the term conformity is often better than accuracy for assessing agreement between an evaluated entity and the accepted criterion. Nevertheless, accuracy is usually applied for tests of diagnostic markers. As a label for the reference criterion, gold standard has also become popular and conventional, despite occasional objections that the fluctuating value of gold is undesirable for an allegedly constant criterion.

This chapter is devoted mainly to quantitative methods of expressing conformity. Although diagnostic marker tests are the main topic, conformity can also be appraised for spectral marker tests, which are used for diverse clinical conditions rather than diseases, and for many clinical decisions beyond diagnosis alone. The mathematical methods often appear in medical literature, but they are beset by important, often overlooked, and currently unresolved scientific problems that are discussed in Sections 21.8 and 21.9.

21.2 Statistical Indexes of Diagnostic Efficacy

The idea of diagnostic efficacy was introduced when scientific and statistical problems arose during “screening” for disease in apparently healthy people. After World War II, searches for tuberculosis were done with photofluorography, a simple, quick procedure that produced much smaller films than the customary “gold standard” chest X-ray. In 1947, observer variability and accuracy in the use of photof - luorography were reported1 by a group of physicians working with Jacob Yerushalmy, an epidemiologist. Later that year, Yerushalmy2 introduced the terms sensitivity and specificity, which have subsequently become the “established” statistical indexes for appraising diagnostic performance.

© 2002 by Chapman & Hall/CRC

21.2.1Structure of a Decision Matrix

The results of diagnostic tests are usually expressed in a 2 × 2 table, sometimes called 3 a decision matrix, showing frequency counts for the binary results of yes/no for presence of the disease, and positive/negative for the marker test. Table 21.1 resembles all other 2 × 2 tables, but the results are commonly expressed with two statistical indexes aimed at diagnostic efficacy. Two of the indexes, as christened by Yerushalmy, were sensitivity and specificity. Sensitivity, which is v = a/(a + c) = a /n1 in Table 21.1, is the proportion of “true positive” results in diseased cases; and specificity, which is f = d/(b + d) = d/n2, is the proportion of “true negative” results in the nondiseased controls. Another common index, called prevalence, is the proportion of diseased cases in the total group under study, expressed as P = n1/N.

TABLE 21.1

Components of Decision Matrix for Diagnostic Marker Tests

 

Correct (“Gold Standard”)

 

Diagnosis Made

Diagnosis of Disease

 

from Marker Test

Present

Absent

Total

 

 

 

 

Positiveþ

a

b

ml

 

(true positive)

(false positive)

 

Negative

c

d

m2

 

(false negative)

(true negative)

 

TOTALþþ

n1

n2

N

Note: “Sensitivity” = a/(a + c) = a/n1; “Specificity” = d/(b + d) = d/n2; “Positive predictive accuracy” = a/(a + b) = a/m1; “Negative predictive accuracy” = d/(c + d) = d/m2 .

These three indexes — sensitivity, specificity, and prevalence — are commonly used in statistical discussions, and are calculated “vertically” from the columns in the table. In making decisions for individual patients, however, clinicians usually want to know the rates of diagnostic accuracy that are shown “horizontally” in the rows of the table. These different directions of interpretation are the source of the major problems to be discussed shortly.

21.2.2Omnibus Indexes

The direction of interpretation can be avoided with an “omnibus” index, which offers a single summary for a result that otherwise requires two or more separate citations, such as sensitivity and specificity.

For a diagnostic marker test, one omnibus expression, called index of validity, is essentially the same as percentage agreement. In the symbols of Table 21.1,

Index of validity = (a + d)/N

Another omnibus index, called Youden’s J, was suggested as a compensation for error in the “vertical” indexes. The false positive rate in nondiseased people, i.e., b/(b + d), is subtracted from the true positive rate in diseased people, i.e., a/(a + c). When the algebra is developed, Youden’s J turns out to be v – (1 – f) = v + f 1, which is simply the sum of sensitivity + specificity 1.

The omnibus simplification that combines these two indexes is also a prime defect. It obscures what is needed for two separate clinical decisions: the accuracy of the marker test in diseased and in nondiseased persons. Furthermore, when the test is applied to an “unknown” group, the clinician will want to know “predictive” accuracy separately for positive and negative results. The omnibus indexes are hardly ever used today because the single combined result does not provide the desired information.

21.2.3Problems in Expressing Rates of Accuracy

The terms sensitivity and specificity were quite appropriate when the original research, designed in a case-control manner, contained groups of cases, who were known to have the disease, and controls, who

© 2002 by Chapman & Hall/CRC

did not. In the case-control design, these two groups served as denominators of the original statistical indexes, which were a/n1 for sensitivity and d/n2 for specificity.

21.2.3.1Problems in Nomenclature — The case-control approach, however, led to major conceptual problems that have never been easily resolved. One problem is in nomenclature. Since the

opposite of a true positive is a false positive, the latter title might be expected for the additive reciprocal of the “true positive” index of sensitivity. The value of 1 (a/n1) = (n1 a)/n1 = c/n1, however, refers

to false negative, not false positive, diagnoses for the diseased cases. Similarly, the additive reciprocal of the true negative result for specificity, i.e., 1 (d/n2), is b/n2, which refers to false positive diagnoses for controls, rather than the intuitively expected idea of false negatives.

To avoid confusion, Henrik Wulff4 proposed using the more precise terms nosologic sensitivity and nosologic specificity for the “vertical” statistical indexes that are calculated nosologically, from cases and controls whose true state of disease is already known. The precise adjectives have generally been omitted, however, and sensitivity and specificity are seldom cited with their nosologic prefixes.

21.2.3.2Problems in Clinical Direction — A second problem is in the direction of clinical application. The terms sensitivity and specificity, although perhaps satisfactory for a case-control struc - ture, do not indicate what a clinician does with a diagnostic marker test. For persons with unknown diagnoses (in clinical practice), the clinician wants statistical indexes to show the marker test’s rates of accuracy when results are positive or negative. In Table 21.1, these rates would be determined, respec-

tively, as a/m1 and d/m2, not as a/n1 and d/n2.

If the statistical nomenclature were concerned with scientific clinical precision, the “predictive” rates might have been called, respectively, diagnostic sensitivity and diagnostic specificity. An alternative, but longer, pair of designations would have been diagnostic true positive rate and diagnostic true negative rate. The reciprocal values for these diagnostic rates would have been intuitively easy to understand, because each positive or negative rate would have true and false reciprocal components.

Instead, however, bowing to the established case-control definitions and improperly using the term predictive for estimating a concomitant rather than future event, investigators designated the desired

clinical rates as positive predictive accuracy (for a/m1) and negative predictive accuracy (for d/m2). Massive ambiguity and confusion can then occur when writers talk about a “false positive rate” or a

“false negative rate,” without indicating which denominator is used for the rates.

21.2.3.3Problems of Prevalence — A third and more profound statistical problem was soon recognized. Because comparative differences are often best demonstrated in contrasts for equal numbers of members, the case and control groups chosen for most diagnostic marker research had roughly similar

sizes, i.e., n1 n2. With this partitioning, the disease had a prevalence of about 50%, i.e., n1/N .5, in the case-control research. At that level of prevalence, high values of nosologic sensitivity and specificity would be converted into correspondingly high values of predictive diagnostic accuracy.

For example, consider a marker test that has 92% sensitivity and 96% specificity in a group of 50 diseased cases and 50 nondiseased controls. The numerical results, shown in Table 21.2, have relatively high values of .96 for positive and .92 for negative predictive accuracy. The prevalence of the disease in this carefully selected test group, however, is .50 — a situation that seldom occurs in real life, even in tertiary-care hospitals. When the diagnostic marker test is used for screening purposes in the community, the prevalence of disease is substantially lower — at rates of .1, .05, or .001.

Table 21.3 shows the performance of this same diagnostic test when applied to 1,000 persons in a community where prevalence of the disease is .05. The sensitivity and specificity of the test remain the same: .92 in the 50 diseased persons, and .96 in the 950 who do not have the disease. The negative predictive accuracy is even better than before, with a rate of .996 (= 912/916). The positive predictive accuracy, however, falls drastically to a value of .55 (= 46/84), indicating that about one of every two positive results will be a false positive.

The two omnibus indexes of Section 21.2.2 would not be able to detect this problem. The index of validity will be (46 + 48)/100 = .94 in Table 21.2 and (46 + 912)/1000 = .958 in Table 21.3. The high

© 2002 by Chapman & Hall/CRC