Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Ординатура / Офтальмология / Английские материалы / Principles Of Medical Statistics_Feinstein_2002

.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
25.93 Mб
Скачать

zone, the results are too “iffy” for diagnostic decisions, and a more confident diagnosis would require further information.

TABLE 21.6

Trichotomous Clinical Summary of Results in Table 21.4

 

 

 

Diagnostic

Positive

Negative

ST Segment

Number of

 

Probability of

Likelihood

Likelihood

Depression

Cases

Controls

Disease

Ratio

Ratio

 

 

 

 

 

 

≥ 2.5 mm.

46

0

1.00

≥ 0.5 mm. but < 2.5 mm.

101

97

þþ.51

1.04

юююююю.96

< 0.5 mm.

3

53

þþþþ.054

þþþþ.057

17.5

TOTAL

150

150

.5

1ююююю

 

 

 

 

 

 

The simple trichotomous approach is particularly easy, effective, and commonly used by clinicians. The approach requires almost no mathematical adjuncts or calculations. Its main disadvantage is that clinical “common sense” may not be cherished or useful for obtaining grants and writing publishable papers.

21.4 Stability of Indexes

Regardless of whether the diagnostic indexes come from direct, Bayesian, ROC, likelihood-ratio, or even “clinical judgment” methods, the indexes can be quantitatively unstable if derived from small numerical components. Their stability can be appraised with the same confidence-interval methods used for proportions and for ratios. If a sensitivity or specificity value is derived from a proportion such as p = r/n, a 95% confidence interval can be determined from an appropriately chosen binominal distribution or from the Gaussian calculation of p ± 1.96 pq/n , where q = 1 p.

Being derived as a ratio of two proportions, likelihood ratios require more complex methods. In any selected zone, the 95% confidence interval for the likelihood ratio can be calculated, according to Simel et al.18, as

exp [ln(pl /p2) ± 1.96 (q1 p1 n1 ) + (q2 p2 n2 ) ].

[21.16]

The “exp” symbol in this formula is a typographically easy way of writing “e to the power of”; for example, exp (w) is ew. The value of p1 is the analog of sensitivity, formed in a particular row by the proportion of (positive results in cases)/(total no. of cases), with q1 = 1 p1; and p2 is the analog of 1 – specificity formed by the proportion of (positive results in controls)/(total no. of controls), with q2 = 1 p2. According to the symbols developed at the beginning of Section 21.3.3, each p1 = ti/T and

each p2 = si/S.

Confidence intervals are useful not only for indicating the possible “range” of the sensitivity/specificity and likelihood-ratio indexes, but also for considering weak numerical strength as an explanation for situations in which a marker test did not yield the expected high values of efficacy.

A recent proposal19 that chance corrections be applied to indexes of diagnostic efficacy, analogous to the kappa coefficient used for indexes of agreement, does not yet seem to have evoked suitable evaluations.

21.5 Combinatory and Multivariable Methods

In all the discussion so far, the marker result came from a single diagnostic test. Sometimes, however, the marker can come from a combination of multiple tests or variables.

© 2002 by Chapman & Hall/CRC

21.5.1Simultaneous Tests

Because a single marker test may not be adequate, several marker tests can be combined simultaneously or ordered in a sequence that is prompted by results of previous tests. For example, a composite “dipstick” marker for urinary tract infection may contain two tests, not one. The efficacy of the dipstick marker can then be evaluated for positive results either in both component tests or in only one of the two.

21.5.2Sequential Tests

A “battery” of tests is often ordered all at once to save time in hospitalized patients. In ambulatory situations, however, the same (or a smaller) set of marker tests may be ordered sequentially in an ad hoc manner. The sequential contribution of each test can be assessed from incremental changes in the likelihood ratios, posterior probabilities, or other indexes of diagnostic efficacy that existed before the additional results were obtained.

Despite the apparent efficacy when used alone, a particular individual test may have unimpressive incremental efficacy when added to other tests. For example, in a discussion of ECG-Tc 99m exercise tests after myocardial infarction, Staniloff et al.20 and later Ladenheim et al.21 decried “the incremental information boondoggle: when a test seems powerful but isn’t.” In analogous comments about the merit of electrophysiologic testing after myocardial infarction, Goldman22 lamented the absence of proof for a “significant, incremental prognostic value.”

21.5.3Multivariable Analyses

The results of diverse accompanying variables (for demographic, clinical, co-morbid and other pertinent features) can be entered along with the results of the marker test in a multivariable analysis that develops a statistical model for estimating the probability of a particular disease in a particular person. This tactic eliminates all of the “bivariate” statistics devoted to sensitivity, specificity, ROC curves, and likelihood ratios.

Diverse mathematical methods can be used. The multiple variables can be combined with logistic regression,23 with discriminant function analysis, 24 in a simple point score system formed from regression coefficients,25 or in an algorithmic succession of categories.26 The analytic methods can also incorporate27 a “cost” or “regret” matrix that gives suitable weights to “partial” agreements or disagreements. Although the multivariable approaches have been enthusiastically advocated, their advantages have not yet been well documented.

21.6 Conformity in Laboratory Measurements

The conformity of tests in modern laboratories is commonly assessed with activities called quality control. Aimed at mensurational accuracy rather than diagnostic efficacy, the assessments are intended to find and repair disagreements when the same specimen is tested repeatedly in the same laboratory and when the laboratory’s results are compared with those of a reference laboratory. For these appraisals of dimensional data, the results can be cited either for variability in agreement or for accuracy in comparison with a reference standard.

In a graphic plot of values for pairs of measurements, a rectilinear regression coefficient of 1 and an intercept of 0 (denoting a straight line slope of 45°, passing through the origin of the graph), would indicate that the two methods on average produce the same result. This apparently excellent index can often be achieved, however, despite considerable disagreement in the individual pairs of measurements. An example was shown earlier in the appraisal of concordance for data set D in Figure 20.1. For this and other reasons noted in Chapter 20, regression coefficients or the Pearson correlation coefficient are not a good way to express agreement, but they still continue to be used.28

© 2002 by Chapman & Hall/CRC

The best alternative index has not yet received a unanimous consensus, but the most commonly recommended method today is the examination of pairwise increments, as discussed in Section 20.7.1. Other procedures regarded as less desirable are checking the standard deviation of the residual error around the regression line29 or calculating the intra-class correlation coefficient (ICC).

21.7 Spectral Markers

In contrast to a diagnostic marker, which separates a particular disease from all other medical entities of health or illness, a spectral marker usually indicates the status of persons within the spectrum of a particular disease or condition. Thus, a diagnostic marker would denote that a patient has (or does not have) cancer of the colon. A spectral marker would denote that the cancer is in Stage I (or some other stage). The distinction is shown in Figure 21.4.

UNIVERSE OF

CLINICAL CONDITIONS

DISEASE

IV

I

 

D

III

II

 

FIGURE 21.4

Diagnostic and spectral markers. In the figure on the left, a diagnostic marker test is intended to discriminate between disease D and all other conditions in the clinical universe. In the figure on the right, a spectral marker test is intended to discriminate among different portions (such as stages I, II, III, and IV) of the spectrum of disease D.

A spectral marker can be used for many clinical decisions — such as etiology, prognosis, choice of therapy, changes of therapy, or reassurance — other than diagnosis alone. For example, the estrogen receptor test has been used as a spectral marker in estimating prognosis and choosing therapy for patients with breast cancer. The carcinoembryonic antigen (CEA) test, which was introduced as a diagnostic marker for colon cancer, has now been relegated to being a spectral marker, denoting whether metastasis has occurred. In the “staging” role, a spectral marker can sometimes be used in post-therapeutic monitoring to denote transitions in clinical condition. Thus, after removal of a colon cancer, the CEA test may be repeatedly checked to determine whether the cancer has recurred.

In addition to roles in prognosis, therapeutic choices, and post-therapeutic monitoring for a particular disease, spectral markers can denote the “severity” of either a disease or a nonspecific clinical condition. For example, the APACHE index30 contains a combination of laboratory tests used to indicate the severity of acute illness for patients in an emergency or intensive-care setting.

21.7.1Use of Spectral Marker without “Gold Standard”

Just as a “gold standard” is used to evaluate the accuracy of a diagnostic marker, an analogous reference criterion can be used for a spectral marker. For example, if a CEA result indicates metastasis, the gold standard is anatomic evidence of the presence (or absence) of metastasis, obtained via imaging, biopsy, surgical inspection, or autopsy. If a laboratory test shows that a bacterium is sensitive to a particular antibiotic, the “gold standard” is (probably) the patient’s post-therapeutic response to that antibiotic.

In many instances, however, a direct “gold standard” does not exist; and the spectral marker result becomes the main data used for a decision that must be evaluated some other way. This situation commonly arises when the results of a monitoring test (such as level of serum lithium or intra-ocular

© 2002 by Chapman & Hall/CRC

pressure) are used to change or adjust therapy; when a scan of the brain in a patient with stroke is used to assure the patient, family, or physician that a surgically remediable lesion is absent; or when indexes of “severity” are proposed for diverse purposes.

21.7.2Pre-Spective vs. Post-Spective Expressions

A common statistical problem in expressing results of spectral markers (and other variables) is the use of “backward” summaries for results that have “forward” implications. Suppose serum bilirubin levels on admission to the hospital are examined as possible predictors of hepatic encephalopathy. Using a case-control approach, the investigators assemble a case group, whose members have developed encephalopathy, and a control group, whose members have not. The bilirubin levels in the two groups are then summarized, perhaps as means and standard deviations, and then compared. If the results show a significantly higher average level in the encephalopathy group, the investigators may conclude that an elevated bilirubin predisposes to encephalopathy. This conclusion may be correct, but is useless for future application, because it offers no “predictive” information about levels of risk. The results were cited “post-spectively” as baseline values per outcome events, rather than “pre-spectively” as outcome events per baseline values. In a pre-spective citation, the billirubin values would be demarcated into levels such as 0–1.9, 2.0 3.9, and Š 4.0. A rate of occurrence for encephalopathy might then be cited for each level, with expressions such as 0/100, 1/35, and 4/15.

Investigators may be reluctant to use these citations mainly because they require establishing levels of demarcation — a more difficult task than simply calculating the “post-spective” means and standard deviations. An additional problem is that the rates of occurrence, being obtained from case-control rather than cohort data, do not represent correct values of risk. Nevertheless, such demarcations are regularly used to obtain “dose-response” patterns in case-control data; and odds ratios can be used to avoid the connotation of risks.

The main point is that a “post-spective” citation of summaries for baseline data in the outcome groups does not allow the results to be used in predicting outcomes. For such predictions, the data must be cited “pre-spectively” as outcomes per level of baseline data, not in the reverse manner. This abuse of temporal direction frequently occurs, however.

21.7.3Problems in Evaluation

The evaluation of spectral markers is a complex challenge that has not yet been fully mastered. The main problem is the delineation of what is being “marked” by the spectral marker. If the CEA test denotes the presence or extensiveness of metastases, its efficacy can be directly checked for that role. If results of upper gastrointestinal endoscopy are used as a marker that affects therapeutic rather than purely diagnostic decisions, the results can be evaluated31 for the changes they produce in previous plans of treatment. If a clinical staging system is used for prognostic predictions, the results can be checked for the associated gradient in outcomes such as survival rates.

If the test is a marker of severity, however, the results cannot be evaluated without a clear definition of what is meant by severity. Does it refer to the amount of medical and nursing care needed for an acute illness, to the anticipated length of stay in an intensive care unit, to the patient’s functional limitations, to the costs of care, or to other “gold standards” such as anticipated length of life, “activity” of an inflammatory disease, size of the heart, or size of a myocardial infarction?

Because clinicians have not yet clearly demarcated both the phenomena to be evaluated and the methods of evaluation, the appraisal of spectral markers is currently in a primordial state, awaiting better research ideas and strategies. The clinical inertia in developing strategy for the evaluation process has often led to methods that lack clinical sophistication. For example, many clinicians do not like the APACHE index of acute severity because it seems to be an arbitrary mathematical pastiche of multiple laboratory variables, having no clearly defined pathophysiologic connotations and excluding the subtle effects of co-morbid conditions. Yet these clinicians have not constructed or offered anything better. The DRG (diagnosis-related-group) system of demarcating categories of illness is perhaps the best illustration of the hazards that occur when clinical investigators avoid the scientific challenge of evaluations that

© 2002 by Chapman & Hall/CRC

can improve efficiency and reduce costs. Few clinicians would approve the way that the DRG “marker” system was constructed32,33 — but it has become widely disseminated because of the principle of faute de mieux (lack of anything better), and because of its sponsorship by third-party payers of health care.

21.8 Scientific Problems in Diagnostic Statistics

Despite all the statistical attention, the evaluation of diagnostic marker tests is beset with five major scientific problems that have often made the diagnostic statistics unsatisfactory either for individual patient-care decisions or for policy evaluations of informational technology. The problems are briefly summarized here because the scientific details and illustrations, which can be found in the cited references, are beyond the scope of a mainly statistical text.

21.8.1Surrogate vs. Other Diagnostic Roles

In the main evaluations thus far, the result of the marker test was used directly to identify the disease as present or absent. The strategies used for these evaluations will seldom be applicable when a test has other diagnostic roles. The result of the test may act as a definitive gold standard (e.g., liver biopsy or glucose tolerance test), as multidiagnostic information pertinent for many diagnoses (e.g., chest X-ray, abdominal ultrasound), as a prerequisite diagnostic demand (e.g., demonstration of Group A streptococcal infection for rheumatic fever34), or as contributory evidence in which a diagnosis is made only when the test’s result is combined with other types of data (e.g., enzymes plus ECG plus clinical history for acute myocardial infarction). A test’s performance in each of these different roles will require indexes different from those that have been developed for surrogate efficacy alone.

21.8.2Diagnostic Performance

Not all tests are ordered for the same type of diagnostic performance. A discovery test, used to “screen” persons with no symptoms or overt manifestations of disease,35 has a job different from that of an exclusion (or “rule-out”) test, which, when negative, assures that the disease is absent. Conversely, a confirmation (or “rule-in”) test is used to give assurance that the disease is present. Examples of these three distinctions are urinary glucose in screening for diabetes mellitus, a negative echocardiogram to exclude significant cardiac tamponade, and urinary red blood casts to confirm nephronal inflammation. The demands for efficacy will differ with these different functions. A confirmation test needs high specificity, regardless of sensitivity; an exclusion test needs high sensitivity, regardless of specificity; and a discovery test needs both high sensitivity and high specificity to avoid too many false positive and false negative results.

Because of the horizontal-vertical converse reasoning, the nomenclature for these different performances often seems counter-intuitive. A rule-in test for the disease should have high specificity in the nondiseased group, and a rule-out test should have high sensitivity in the diseased group. Beyond these paradoxes in nomenclature, however, the different goals of diagnostic marker tests lead to several statistical paradoxes in evaluating performance.

Being usually invoked when suitable suspicion has been aroused by other evidence, the “rule-in” and “rule-out” procedures need not be splendid in both sensitivity and specificity. A discovery test, however — which may be frequently used because of its convenience — is desirable only if it is excellent in both attributes. If too insensitive, it will fail to discover enough cases, and if too nonspecific, it will yield too many false positives. Thus, the simple discovery test used for “general” screening should preferably have a better performance record than the exclusion and confirmation tests used in more “specialized” clinical circumstances.

Another problem in evaluating a test’s performance is the choice of a suitable “gold standard.” Should a fecal occult blood test be evaluated for its ability to detect blood or to detect colorectal cancer? The test may be excellent for identifying blood as an immediate target, but relatively poor for identifying

© 2002 by Chapman & Hall/CRC

cancer as an anatomic source. Similarly, a urine test for protein may be splendid for identifying protein, but less effective at demonstrating renal disease.

Finally, clinicians may create problems by failing to distinguish between the existence of a disease and its causal role in producing a particular manifestation. Because many diseases can exist “silently,” without provoking symptoms or other manifestations,35 the demonstration of a particular diagnosis may identify the disease without supplying an appropriate pathophysiologic explanation for the patient’s overt clinical problems. For example, a patient’s angiogram may show major coronary disease despite a history that is negative for angina pectoris and positive for postprandial pain relieved by antacid. In this instance, the existence of the coronary disease does not offer a pathophysiologic explanation for the pain.

When clinicians do not distinguish between existence and explanation, certain diagnostic tests may lead to unnecessary therapy. For example, because the classical symptoms of functional bowel distress are not explained if “silent” gallstones are found on an abdominal ultrasound examination, the removal of the stones would not be expected to offer enduring relief for the symptoms.

21.8.3Spectral Composition

The fundamental but often unrecognized problem in all of the case-control statistical indexes and mathematical transformations is that they rest on an erroneous assumption of constancy.36 They assume that sensitivity and specificity, or likelihood ratios, remain the same for any cases of disease and for any control group without disease. This assumption has turned out to be wrong, because the values of the indexes will differ according to demographic, clinical, and /or co-morbid distinctions in the spectrums of patients who constitute the cases and controls.36–40

Consequently, the indexes calculated for diagnostic accuracy will vary not just with prevalence but with the spectral composition of the subgroups of patients who receive the test. This unfortunate fact of clinical reality essentially vitiates all of the splendid mathematical theory that has been developed for diagnostic marker analyses. The best approach, as practicing clinicians have already discovered,10 is to determine diagnostic accuracy for the pertinent collection of patients seen in a particular clinical practice.

The magnitude of this problem in a fixed index of efficacy was quantified by Lachs et al.39 for a composite dipstick marker test used to diagnose urinary tract infection. Among patients with a “high” prior probability of infection — i.e., those with pertinent suspicious symptoms or other clinical manifestations — the dipstick test had sensitivity 0.92 and specificity 0.42. Among patients with a “low” prior probability — i.e., those who lacked the cited clinical manifestations — the sensitivity and specificity varied directly with the degree of pyuria. For three ordinal groups having 0, 1–5, and > 5 leukocytes per microscopic field of spun urine, sensitivity rose progressively from 0.50 to 0.68 to 1.00 and specificity declined progressively from 0.90 to 0.68 to 0.22.

21.8.4Bias in Data and Groups

Regardless of whether the appraisals are done “vertically” or “horizontally,” the results can be biased by problems in the raw data or in the composition of groups. Because the marker test and the gold standard procedure occur in a sequence, the raw data for the results will not be objective if the interpreter of whichever procedure came second is aware of what was found previously. To avoid this type of review bias, the second procedure should always be examined blindly, without the reviewer knowing the previous results.

In composition of groups, the patients chosen to receive the gold-standard test may not equally represent the spectral composition of all possible candidates. The results of spectrum bias in these choices may then produce indexes distorted by group imbalances that are diversely called36,39–42 work-up, verification, or referral bias.

In ordinary clinical practice, the definitive test may not be ordered for everyone if it seems too costly or possibly hazardous. Accordingly, when diagnostic markers are evaluated from tests done in ordinary clinical circumstances, a definitive result may not be available for many patients, particularly those who

© 2002 by Chapman & Hall/CRC

had a negative marker test. In this situation, the best way to avoid “workup” (or “spectrum”) bias is to get surrogate information for the patient’s definitive status. For example, a definitive diagnostic biopsy is almost always done for patients with a “positive” mammogram, but not for those with a “negative” result. Therefore, in a study of mammographic diagnoses, Elmore et al.43 restricted the eligible “negative” patients to those who had had at least three years of follow-up without evidence of cancer and who had another negative mammogram three years later. A simple long-term follow-up showing absence of the suspected disease may sometimes suffice as a suitable “reference standard,”44 without repeating the original marker test.

Yet another problem is the incorporation bias that arises when the result of a marker test is incorporated into the evidence used for the definitive diagnostic conclusion.36 For example, if a serum amylase result is used to make definitive diagnostic decisions about acute pancreatitis, the amylase test is no longer merely a marker. It becomes part of prerequisite evidence and should not be checked for sensitivity and specificity.

21.8.5Reproducibility

A separate but often overlooked problem refers to issues in reproducibility rather than accuracy. Variability can occur whenever the result or interpretation of a test requires human observation and communication. Nevertheless, checks of intra-personal and inter-personal variability are seldom done for the obviously subjective work of radiologists, histopathologists, and cytopathologists, and also for the less obviously subjective laboratory observations used for such examinations as flocculation, dark fields, and white blood cell differential counts.45

Unless basic reproducibility has been demonstrated, all the subsequent calculations may sometimes resemble an exercise in futility. The indexes of efficacy will have been determined for data whose fundamental reliability is uncertain.

21.9 Additional Challenges

Many additional major challenges, not yet discussed, are prominently available for thoughtful research in an era of proliferating technology, escalating costs, and increasing complaints about “dehumanized” clinical care.

21.9.1Diagnostic Evaluations and Guidelines

Despite various recommendations for the contents and phases of evaluation, most diagnostic marker tests still come into widespread clinical usage before they have been adequately evaluated. Reid et al. 46 have recently shown that the proportion of satisfactory evaluations is rising with time but is still not good. Of 34 marker-test appraisals reported in four leading general medical journals during 1990–1993, more than 50% failed to meet at least 3 of 6 methodologic standards and only 6% complied with all 6 standards.

In view of these basic scientific defects, many clinicians are surprised or appalled47,48 when “guidelines” for using the tests49–52 are issued by prominent clinical organizations. The organizations may hope to do a “preemptive strike,” offering better guidelines than what might otherwise be promulgated by governmental or corporate agencies, but a more fundamental approach would be to convince the public and policy makers that suitable guidelines cannot be constructed because the necessary fundamental research is absent. Arrangements can then be made to carry out the appropriate research.

21.9.2Non-Diagnostic Roles

As noted earlier, many technologic tests have a critically important role in non-diagnostic decisions, such as estimating prognosis or choosing or monitoring therapy. For example, laboratory tests of an

© 2002 by Chapman & Hall/CRC

infectious organism’s “sensitivity” are done to select appropriate antibiotics, and tests of blood (or sometimes urine) levels may be used to monitor treatment with psychotropic or other chemical agents. An MRI scan of lumbar vertebrae may be intended not to diagnose a herniated disc, but to decide therapeutically whether more than one disc must be treated. The availability of various imagings for neoplastic involvement of abdominal (and other) lymph nodes has replaced the surgical explorations that were formerly done as “staging” for choosing treatment of Hodgkin’s or other lymphomatous disease.

These often invaluable clinical contributions of technologic information are neglected in statistical indexes aimed at only diagnostic performance. Consequently, to offer satisfactory appraisal for the total merit of a technologic procedure, new statistical indexes must be developed to account for all of a test’s contributions to diverse clinical decisions, not just diagnosis alone.

21.9.3Appraising Reassurance

An important but often neglected merit of technologic tests is the reassurance they bring to clinicians, patients, and patients’ families. For example, in most patients with a classical “cerebrovascular accident,” the CT or magnetic resonance imaging (MRI) scan of the head seldom alters the main diagnosis, the estimated prognosis, or the therapeutic plans that would have been made without the scan. Nevertheless, by demonstrating that the patient does not have a surgically remediable lesion — such as a meningioma or subdural hematoma — the imaging provides a relatively risk-free form of important reassurance. In the era before the new images, this reassurance required the horrors of pneumoencephalography or the hazards of carotid arteriography.

Many older clinicians would have given the CT scan a Nobel prize for its role merely in providing risk-free reassurance that a stroke is a stroke. Yet the immense human importance of this reassurance is not currently appraised, or even “valued” enough to be regarded as warranting appraisal.

21.9.4Automated Observations

The automated observation of images has been successfully applied for diagnosing a single state in an electrocardiogram, and is now being used for differential leukocyte counts53 and for ocular perimetry.54 Efforts55 are now in progress to develop automated image analysis in mammography, lung cancer, cervical cytology, and fine needle aspirates of the breast.

Some of the main statistical challenges in the automated-observation process are to choose both a suitable gold standard for the validation and suitable methods of identifying the image. For example, who is the person to be used as “gold standard” for interpreting a leukocyte differential smear or a mammogram? Should the recognition process be aimed at a direct recapitulation of the image or at a transformed attribute? Thus, in differentiating leukocytes, the automated entity is a histogram of light intensities in a grid that covers each cell, not a direct visual “portrait” of the cell.

21.9.5“DNA” Diagnostic Markers

In the era of molecular biology, many genetic, oncologic, and other diagnoses are explored with markers that use DNA probes or polymorphism analysis.56–58 Bogardus et al.59 have recently discussed the striking methodologic flaws in many of these studies, including absence of objectivity, failure to check for test reproducibility, and an unsuitable spectrum of case and/or control groups. Better methods might be demanded sooner rather than later to avoid the devastating human and scientific effects of false positive results for genetic “risk” and of erroneous directions in genetic research.

21.9.6Neural Networks

A technique still in its infancy does multivariable diagnostic analysis with the special pattern recognition methods of a neural network, rather than with mathematical procedures such as logistic regression. An impressive set of results has been reported60 for efficacy of neural-network analysis in diagnosing myocardial infarction among adults presenting to a hospital emergency department; and further work is

© 2002 by Chapman & Hall/CRC

now in progress. To be scientifically acceptable, however, the neural-network results will require careful validation in “external” challenge groups. Most of the work reported thus far has been validated only “internally,” in the same group from which the neural-network model was constructed. The credibility of these analytic models will depend on how well they perform when exposed to the challenge of new “unknown” groups, and how well they can identify the most cogent variables.

21.9.7Constructing “Silver Standards”

When not available or possible, a “gold standard” diagnosis can be obtained by noting the patient’s eventual outcome or by using an authoritative diagnosis made without the marker result. In the absence of these methods for getting a “gold standard,” a new statistical strategy has been proposed61 to determine efficacy from repeated observations of the marker test. Because “gold standard” results will inevitably be absent for many test procedures, the construction of a suitable alternative “silver standard” (to substitute for the “gold”) is an intriguing challenge.

References

1. Birkelo, 1947; 2. Yerushalmy, 1947; 3. McNeil, 1975; 4. Wulff, 1981; 5. Sackett, 1991; 6. Moller-Petersen, 1985; 7. Jaeschke, 1994; 8. Finney, 1993; 9. Kempthorne, 1975; 10. Reid, 1998; 11. Feinstein, 1990a; 12. Metz, 1973; 13. Weinstein, 1980; 14. Diamond, 1981; 15. Steen, 1993; 16. Feinstein, 1996; 17. Eisenberg, 1984; 18. Simel, 1993; 19. Brenner, 1994; 20. Staniloff, 1982; 21. Ladenheim, 1987; 22. Goldman, 1991; 23. Coughlin, 1992; 24. Lachin, 1973; 25. Mann, 1983; 26. Brand, 1982; 27. Kodlin, 1971; 28. Barnett, 1979; 29. Cornbleet, 1978; 30. Knaus, 1991 ; 31. Lichtenstein, 1980; 32. Thompson, 1975; 33. Fetter, 1980; 34. Special Writing Group, 1993; 35. Feinstein, 1967a; 36. Ransohoff, 1978; 37. Rozanski, 1983; 38. Hlatky, 1984; 39. Lachs, 1992; 40. Feinstein, 1985; 41. Begg, 1991; 42. Knottnerus, 1992; 43. Elmore, 1994b; 44. Hull, 1983; 45. Elmore, 1992; 46. Reid, 1995; 47. Jenkins, 1991; 48. Brook, 1989; 49. Griner, 1981; 50. Sox, 1987; 51. Eddy, 1990; 52. Hospital Association of New York State, 1989; 53. Rosvoll, 1979; 54. Katz, 1988; 55. Cancer Letters, 1994; 56. Wiggs, 1988; 57. Malkin, 1990; 58. Lemna, 1990; 59. Bogardus, 1999; 60. Baxt, 1991; 61. Schulzer, 1991.

Exercises

21.1.About 4% of school-aged children in Megalopolis are believed to be physically abused by their parents. The schools in the city might be able to screen all children for evidence of abuse (e.g., scars, cuts, bruises, and burns), with the intent of follow-up by contacting the suspected parents. School and health officials must be very confident of their suspicions before approaching the parents, however, because a great deal of potential harm can be done either by letting an abused child go undetected or by erroneously suspecting an innocent parent. According to school health officials, the physical examination they use is very reliable: it gives positive results in 96% of abused children, and false positive results in only 8% of nonabused children.

a.What is the nosologic sensitivity of the physical examination?

b.What is the nosologic specificity of the physical examination?

c.If the screening program is implemented in Megalopolis schools, what will be the physical examination’s diagnostic sensitivity, i.e., positive predictive accuracy?

21.2.Your hospital has recently acquired a new non-invasive radiologic test to detect deep venous thrombi (DVT). The manufacturer reports impressive test performance data: nosologic sensitivity, 96%; and nosologic specificity, 98%. To evaluate the accuracy of the new test in your hospital, the radiology department invites the first 100 patients suspected of having DVT and evaluated by the new test to receive additional “gold-standard” testing with a lower extremity venogram. This evaluation yields a nosologic sensitivity of 52%, and a specificity of 65%. Cite and briefly discuss at least four possible

©2002 by Chapman & Hall/CRC

reasons for the reduction in test performance.

21.3. A new diagnostic test for omphalosis has +LR 10.0, with 95% CI 5.0–20.0. You are seeing a patient whom you suspect of having omphalosis, with a pretest probability ranging from 2 to 20%. Using

t he n o mo g r am a nd

ci t ed pr o b ab il i t i es

i n

Figure E.21.3,

indicate

the appropriate

posttest

pr o bab il i t ies .

D o y ou r egar d t hi s

te s t

as

diagnostically useful? Please give brief reasons for whatever answer you choose.

21.4.A new outpatient test for rapid detection of group A beta-hemolytic strepococci has a reported nosologic sensitivity of 88%. The fine print accompanying the test instructions, however, has the following statement: “95% CI 76–100%.” Given this information, how many patients do you think were initially tested by the manufacturer?

21.5.One stated advantage of calculating likelihood ratios, as opposed to predictive accuracies, is that likelihood ratios do not depend on group prevalence. Explain why this advantage occurs.

21.6.Many prominent textbooks and publications have urged that diagnostic evaluations be done with likelihood ratios and/or the conditional probability methods that use sensitivity, specificity, and prevalence to determine rates of “predictive accuracy.” Do you use this approach? If so, please cite and briefly discuss at least three advantages that you have found with it. If you do not use this approach, what are at least three disadvantages that you have noted? What alternative approach do you use or advocate?

21.7.From your clinical background or experience, give an example of a diagnostic marker (not previously cited in the text) that is used mainly as a “rule-in” test and another used as a “rule-out” test. Briefly discuss the reasons that justify the use of each test for the cited purpose.

21.8.From the literature at your disposal, select a study of a diagnostic marker test, and in one or two sentences, outline its basic arrangement. Comment on the selection of case and control groups. If you do not fully approve of the selections, what alternatives would you suggest? Make any other critical comments or architectural suggestions that occur during your review of the study.

.1

 

 

 

 

 

 

99

 

 

 

 

 

.2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.5

 

 

 

 

 

 

95

 

 

 

 

 

 

1

 

 

1000

 

 

 

90

 

 

 

 

 

2

 

500

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

200

 

 

 

80

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

100

 

 

 

70

 

 

 

 

 

5

 

 

50

 

 

 

 

 

 

 

 

20

 

 

 

60

 

 

 

 

 

 

 

 

 

 

10

 

 

10

 

 

 

50

 

 

 

 

 

5

 

 

 

40

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

20

 

 

 

 

 

30

 

 

 

 

 

 

 

 

 

%

 

1

 

 

 

%

 

 

 

 

30

 

 

 

 

.5

 

20

 

 

 

 

 

 

 

 

 

 

 

40

 

 

 

 

.2

 

10

 

 

 

 

 

 

 

 

50

 

 

 

 

.1

 

 

 

 

 

 

 

 

 

 

 

 

60

 

 

 

 

.05

 

5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

70.02

.01

80

 

 

 

 

 

.005

 

 

2

 

 

 

 

 

 

 

 

 

 

 

90

 

 

 

 

 

.002

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.001

 

 

1

 

 

 

 

 

 

 

95

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.5

 

 

 

 

 

 

 

 

 

 

 

 

 

99

 

 

 

 

 

 

 

 

.2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

.1

 

 

 

 

 

 

Pretest

Likelihood

Post test

Probability

Ratio

Probability

FIGURE E.21.3

Nomogram showing relationship of pretest probability and likelihood ratio to form posttest probability. {Taken from Chapter Reference 57.]

21.9.Find a set of diagnostic criteria for any disease in which you are interested. If the criteria were tested for their diagnostic efficacy, comment on how well the test was done. If the criteria have not been tested, outline the procedure you would suggest for this purpose.

21.10.What ideas do you have about how to evaluate (and quantify) reassurance?

© 2002 by Chapman & Hall/CRC