25.4.1Controversies and Challenges
The meta-analytic strategy is controversial, being hailed by some writers as a majestic scientific advance68,69 and regarded by others as being statistical “tricks,”70 “alchemy for the 21st century,”71 or “shmeta-analysis.”72 Even the name of the process is controversial. Disliking the etymologic potential resemblance to meta-physics, some authors prefer to use “overview,” “pooling,” or “analytic synthesis” to label the work. Because aggregation is the basis of the statistical activities, meta-analysis should be distinguished from other types of re-appraisal, called “methodologic analyses”73 or “best-evidence syntheses,”74 in which the authors review the collection of literature for a particular topic, isolate the few reports that have really high scientific quality, and then draw conclusions from only those reports.
A full discussion of meta-analysis is beyond the scope of the brief “overview” here; interested readers can find many detailed accounts elsewhere.75–77 The main problems and challenges of meta-analysis, however, do not arise from statistical sources. The prime scientific difficulties come from the imprecision produced by pooling evidence that is inevitably heterogeneous, from problems in the judgmental evaluation of acceptable quality in the original reports, and from the “publication bias” that may make the available literature an unsuitable representative of all the research done on the selected topic.
Many published reports of meta-analyses have come from the combined results of randomized clinical trials. The aggregating activities, although not unanimously applauded by clinical trialists,78,79 have been more widely accepted than the increasing appearance of meta-analyses based on epidemiologic and other observational studies that were done without randomization. The value of meta-analysis for “observational” (rather than randomized-trial) research is particularly controversial, and some prominent epidemiologists72,80 refuse to accept the procedure.
25.4.2Statistical Problems
The main statistical challenges in meta-analysis are the choice of an appropriate index of contrast for expressing the “effect” in each study, the mechanism used for combining individual indexes that may be heterogeneous, and the construction of confidence intervals for the aggregated result.
25.4.2.1Individual Indexes of Effect — A first step is to choose the particular outcome (or other event) that will be indexed. In some investigations, this event is the credible fact of total deaths, but many other studies use the highly flawed data (see Section 17.2.3) of “cause-specific” deaths.
In a second step, regardless of the outcome entity, its occurrence must be expressed in an index of contrast for the compared groups. In meta-analyses of data for the psychosocial sciences, from which meta-analysis originated, the customary index of contrast is the standardized increment or “effect size” that was discussed in Section 10.5.1. Being seldom used in medical research, however, this index has been supplanted by many other expressions, including the proportional increment (or “relative gain”), the relative risk (or risk ratio), the odds ratio, or the log odds ratio. The main problem in all the latter indexes is that they obliterate the basic occurrence rate to which everything is being referred. The same ratio of 4 will be obtained with the major incremental change of 0.3 in risk from .1 to .4, and with the relatively trivial change of .00003 from .00001 to .00004.
If the meta-analytic results are to be understood by clinicians or by patients giving “informed consent” to treatment, comprehension can be substantially improved if the results are cited in direct increments or in the number-needed-to-treat expressions that were discussed in Chapter 10.
25.4.2.2Aggregation of Heterogeneous Indexes — Regardless of which indexes are chosen to express the contrasted effects, the next decision is to choose a method for aggregating the individual indexes. The problem of heterogeneity becomes particularly important here. If contradictory results are indiscriminately combined, the analysts violate both statistical and scientific principles of analysis. Statistically, the combination of opposite results may lead to unsatisfactory combinations and misleading conclusions (as discussed later in Chapter 26). Scientifically, the combination of contradic - tions will violate the opportunity to use Bradford Hill’s well-accepted requirement of consistency 81
©2002 by Chapman & Hall/CRC
(i.e., results of individual studies should almost all go in the same direction) to support decisions about causality.
In many meta-analyses, however, the heterogeneity of individual indexes is usually ignored, and the aggregated indexes are reported merely as means, medians, “weighted summary,” or “standardized summary” statistics. In one novel approach to the problem, the investigators82 obtained the original data for five trials. In contrast to a conventional meta-analysis, the data were pooled according to previously established “common definitions” that made the results “as homogeneous as possible.” The pooled data allowed a “risk factor analysis not possible with meta-analysis techniques,” although the aggregated risk ratios were essentially the same with both methods.
25.4.2.3 Stochastic Tests and Confidence Intervals — The idea of multiple inference or “multiple comparisons” seldom receives attention in meta-analysis, although the same hypothesis was previously tested repeatedly in the multiple studies that are combined. Nevertheless, studies that have been unequivocally positive in both quantitative and stochastic distinctions (as well as in scientific quality) may seldom require meta-analysis. If the latter procedure is confined to a set of k studies with equivocal individual results, the stochastic test of their combination could be regarded as yet another comparison, for which the Bonferroni correction would use k + 1. The combined numbers may often be large enough so that the final P value remains stochastically significant even if interpreted with α′ = α /(k + 1). The correction might be considered, however, when the final P value (or confidence interval) is quite close to the uncorrected stochastic boundary.
With the multiple-comparison problem generally ignored, perhaps the most vigorous statistical creativity in meta-analysis has been devoted to the stochastic method of reporting “significance” of the aggregated summary. The methods include the following: P value, mean P value, “consensus combined” P, Z-score, mean Z-score, t score, “meta-regression” slopes,83,84 the “support” of a likelihood ratio,85 and Bayesian estimates.86
With the current emphasis on confidence intervals rather than other expressions, new attention has focused on the stochastic aggregation of summary indexes. For the odds ratio, which is a particularly popular index, the customary method of aggregation was the Mantel–Haenszel (M-H) strategy,87 which receives detailed discussion later in Chapter 26. In recent years, however, many statisticians have preferred the DerSimonian–Laird (D-L) strategy.88 The main difference in the two approaches is that M-H uses a “fixed-effects,” whereas D-L uses a “random-effects” model. With a fixed-effects model, we assume that the available batch of strata (i.e., individual trials) is the “universe,” and the aggregated variance is calculated from individual variances in each stratum. With a random-effects model, we assume that the available strata (or trials) are a sample of a hypothetical larger universe of trials. Accordingly, variation among the strata is added to the individual variances within each stratum. (The dispute about which model to use somewhat resembles the controversy, discussed in Section 14.3.5, about calculating the chi-square test in a fourfold table with the idea of either fixed or non-fixed marginal totals.)
The net effect of the random-effects model is to enlarge the total variance, thereby making stochastic significance, i.e., rejection of the null hypothesis, more difficult to obtain. In general, however, when large numbers of studies are combined, the total group size is big enough so that most of the quantitatively impressive aggregated indexes are also stochastically significant, regardless of whether variance was calculated with a “random” or “fixed” model.
25.4.3Current Status
Having been vigorously advocated by prominent statisticians and clinical epidemiologists, meta-analysis is becoming a well-accepted analytic procedure. The Cochrane Collaboration, a special new organization, has been established at Oxford to encourage the process and to promote activities in which clinical trialists at diverse international locations contribute their data for meta-analyses that can produce “evi- dence-based medicine.”
Like any new procedure that becomes enthusiastically accepted, meta-analysis will doubtlessly undergo an evolutionary course that reduces the initial enthusiasm and eventually demarcates the true value of
© 2002 by Chapman & Hall/CRC
the procedure. Some of the reduced enthusiasm has already been manifested in occasional contradictions between the results of large new trials and previous meta-analyses of the same relationship. In one striking recent example, the results of at least five published meta-analyses were contradicted by a subsequent large trial,89 which showed that calcium supplementation during pregnancy did not prevent pre-eclampsia, pregancy-associated hypertension, or adverse perinatal outcomes. After ascribing the differences to distinctions in dosage, patient selection, and definition of outcomes, the authors concluded that “metaanalyses cannot substitute for large, well-conducted clinical trials.” In another separate analysis, Lelorier et al.90 found that the outcomes of “12 large randomized, controlled trials ... were not predicted accurately 35 percent of the time by the meta-analyses published previously on the same topics.”
Because a prime source of the problems is scientific rather than statistical, arising from the potential fallacies of heterogeneous combinations and the imprecision produced when important clinical variables are omitted from the data, the ultimate role of meta-analysis will depend more on improvements in clinical science than on better statistical methods for analyzing inadequate data.
References
1. Miller, 1966; 2. Dawber, 1980; 3. Cupples, 1984; 4. Hennekens, 1979; 5. MacMahon, 1981; 6. Feinstein, 1981; 7. Hsieh, 1986; 8. Hoover, 1980; 9. Mills, 1993; 10. Pickle, 1996; 11. Am. J. of Epidemiol., Special Issue, 1990; 12. Walter, 1993; 13. Kupper, 1976; 14. Bonferroni, 1936; 15. Duncan, 1955; 16. Dunn, 1961; 17. Dunnett, 1955; 18. Newman, 1939; 19. Keuls, 1952; 20. Scheffe, 1953; 21. Tukey, 1949; 22. Braun, 1994; 23. Holm, 1979; 24. Aickin, 1996; 25. Hochberg, 1988; 26. Simes, 1986; 27. Levin, 1996; 28. Brown, 1997; 29. Thompson, 1998; 30. Goodman, 1998; 31. Westfall, 1993; 32. Thomas, 1985; 33. Rothman, 1990; 34. Savitz, 1995; 35. Cole, 1993; 36. Feinstein, 1985; 37. Genest; 38. Feinstein, 1988a; 39. Cook, 1996; 40. Feinstein, 1998b; 41. Sanchez-Guerrero, 1995; 42. Boston Collaborative Drug Surveillance Programme, 1973; 43. Boston Collaborative Drug Surveillance Program, 1974; 44. Armstrong, 1974; 45. Heinonen, 1974; 46. Vessey, 1976; 47. Royall, 1991; 48. Freedman, 1987; 49. Wald, 1947; 50. Barnard, 1946; 51. Armitage, 1975; 52. Feely, 1982; 53. Guyatt, 1986; 54. Guyatt, 1988; 55. McPherson, 1974; 56. Freedman, 1983; 57. Pocock, 1992; 58. Haybittle, 1971; 59. Peto, 1976; 60. Pocock, 1977; 61. O’Brien, 1979; 62. O’Brien, 1988; 63. Lan, 1982; 64. Davis, 1994; 65. Jefferies, 1984; 66. Viberti, 1997; 67. Souhami, 1994; 68. Chalmers, 1993; 69. Sackett, 1995; 70. Thompson, 1991; 71. Feinstein, 1995; 72. Shapiro, 1994; 73. Gerbarg, 1988; 74. Slavin, 1995; 75. Hedges, 1985; 76. Rosenthal, 1991; 77. Spitzer, 1995; 78. Ellenberg, 1988; 79. Meinert, 1989; 80. Petiti, 1993; 81. Hill, 1965; 82. Fletcher, 1993; 83. Brand, 1992; 84. Greenland, 1992; 85. Goodman, 1989; 86. Carlin, 1992; 87. Mantel, 1959; 88. DerSimonian, 1986; 89. Levine, 1997; 90. Lelorier, 1997.
Exercises
25.1.Find a report of a study in which the authors appear to have used multiple comparisons. Briefly outline the study and what was done statistically. Did the authors acknowledge the multiple-comparison problem and do anything about it (in discussion or testing)? What do you think should have been done?
25.2.The basic argument for not making corrections for multiple comparisons is that an ordinary (uncorrected) P value or confidence interval will answer the question about numerical stability of the data. As long as the results are stable, their interpretation and intellectual coordination should be a matter of scientific rather than statistical reasoning, and should not depend on how many other comparisons were done. Do you agree with this policy? How do you justify your belief? If you disagree, what is your reasoning and proposed solution to the problem?
25.3.With automated technology for measuring constituents of blood, multiple tests (such as the “SMA-52”) are regularly reported for sodium, potassium, calcium, bilirubin, glucose, creatinine, etc. in a single specimen of serum. If 52 tests are done, and if the customary “zone of normal” is an inner- 95-percentile range, the chance of finding an “abnormality” in a perfectly healthy person should be
©2002 by Chapman & Hall/CRC
1 − (.95)52 = 1 – .07 = .93. Consequently, when healthy people are increasingly “screened” with these tests, a vast number of false positive results can be expected. Nevertheless, at many medical centers that do such screening, this “epidemic” of excessive false positive tests has not occurred. What is your explanation for the absence of the problem?
25.4.Find a report of a randomized trial that seems to have been stopped prematurely, before the scheduled termination. Briefly outline what happened. Do you agree with the early-termination decision? Offer reasons for your conclusion.
25.5.Have you ever done or been tempted to do an N-of-l trial, or have you read the results of one? If so, outline what happened. If you have had the opportunity to do one, but have not, indicate why not.
© 2002 by Chapman & Hall/CRC
26
Stratifications, Matchings, and “Adjustments”
CONTENTS
26.1Problems of Confounding
26.1.1Role of Effect-Modifiers
26.1.2Requirements for Confounding
26.1.3Simpson’s Paradox
26.1.4Scientific Precautions
26.2Statistical Principles of Standardization
26.2.1Weighted Adjustments
26.2.2Algebraic Mechanisms
26.3Standardized Adjustment of Rates
26.3.1Identification of Components
26.3.2Principles of Direct Standardization
26.3.3Symbols for Components
26.3.4Principles of Indirect Standardization
26.3.5Role of “Standardization Factor”
26.3.6Importance of Standardizing Population
26.3.7Choice of Method
26.3.8Clinical Applications
26.3.9Stochastic Evaluations
26.3.10Role of Proportionate Mortality Ratio (PMR)
26.4Standardized Adjustment of Odds Ratios
26.4.1Weighting for 2 × 2 Tables
26.4.2Illustration of Problems in “Crude” Odds Ratio
26.4.3Illustration of Mantel–Haenszel Adjustment
26.4.4Stochastic Evaluations
26.4.5Alternative Stochastic Approaches and Disputes
26.4.6Caveats and Precautions
26.4.7Extension to Cohort Data
26.5Matching and “Matched Controls”
26.5.1Application in Case-Control Studies
26.5.2Choosing “Matched” Controls
26.5.3Formation of Tables
26.5.4Justification of Matched Odds Ratio
26.5.5Calculation of “Matched” X2
26.5.6Scientific Peculiarities
26.6Additional Statistical Procedures
References
Exercises
Beyond the confusing array of rates, risks, and ratios discussed earlier in Chapter 17, epidemiologists use certain strategies that almost never appear in other branches of science. One of those strategies — the “adjustment” of data — is discussed in this chapter.
© 2002 by Chapman & Hall/CRC
The adjustment process is seldom needed when an investigator arranges the experiments done with animals, biologic fragments, or inanimate substances. In the epidemiologic research called observational studies, however, the investigated events and data occur in groups of free-living people. Without the advantages of a planned experiment, the raw information may be inaccurate and the compared groups may be biased. The role of “adjustment” is to prevent or reduce the effects of these problems.
26.1 Problems of Confounding
A particularly prominent problem is confounding, which is easy to discuss, but difficult to define. It arises when the results of an observed relationship are distorted by an extrinsic factor. The consequence is that what you see is true, but is not what has really happened. An example of the problem appeared earlier in Section 19.6.4, when a correct statistical association was found between amyl nitrite “poppers” and AIDS. The interpretation that “poppers” were causing AIDS was wrong, however, because the relationship was confounded by an external factor: the particular sexual activity, often accompanied by use of “poppers,” that transmitted the AIDS virus.
Although confounding is commonly used as a nonspecific name for these distortions, the source of the problem can often be identified from biases arising in specific architectural locations for the baseline states, maneuvers, outcomes and other sequential events in the alleged causal pathway for the compared groups.1
26.1.1Role of “Effect-Modifiers”
The statistical relationship between an outcome and maneuver can be altered by factors called effect modifiers. These factors are not cited (and sometimes not recognized) when the data are merely reported as outcomes for maneuvers, such as success rates for different treatments, or mortality rates for people “exposed” to living in different regions. For example, postsurgical success can be affected by the patient’s preoperative condition, concomitant anesthesia, and postoperative care, not just by the operation itself; and general-population mortality rates are affected by age, not just by region of residence. If the effectmodifiers are equally distributed in the compared groups, bias may be avoided. If the effect-modifiers have a substantially unequal distribution, however, the bias produces confounding.
The relationship between poppers and AIDS was confounded by performance (or co-maneuveral) bias, arising from the initially disregarded sexual activity that accompanied the “maneuver” of using poppers. Performance bias would also occur if results were compared for radical surgery, done with excellent anesthesia and postoperative care, versus simple surgery, done without similar excellence in the “co-therapy.” Confounding can also arise from detection bias if the outcome events are sought and identified differently in the compared groups. “Double-blind” methods are a precaution used in random - ized trials to avoid this bias.
A particularly common source of confounding is the susceptibility bias produced when the compared groups are prognostically different in their baseline states, before the compared “causal” maneuvers are imposed. For example, a direct comparison of “crude” mortality rates in Connecticut and Florida would be biased because Florida has a much older population than Connecticut. An analogous susceptibility bias would occur if survival rates are compared for surgical treatment, given to patients who are mainly in a favorable “operable” condition, versus nonsurgical treatment, for patients who are “inoperable.”
26.1.2Requirements for Confounding
The biases that produce confounding require the concurrence of two phenomena: (1) a particular effectmodifying factor — such as age, “operability,” or certain sexual activity — has a distinct impact that alters the occurrence of the outcome event, such as death, “success,” or AIDS; and (2) the factor has a substantially unequal distribution in the compared groups. If these phenomena occur together, the factor is a confounding variable or confounder. Thus, the relatively good clinical condition that constitutes
© 2002 by Chapman & Hall/CRC
“operability” can affect survival (regardless of whether surgery is done), but the variable is not a confounder if it is equally distributed in randomized groups having similar “operability.” In most nonrandomized comparisons of surgical and nonsurgical groups, however, the survival rates are usually confounded by susceptibility bias, because the non-surgical group contains predominantly inoperable patients. Analogously, if the populations have similar age distributions, the general mortality rates would not be confounded in comparisons of Connecticut vs. Massachusetts, or Florida vs. Arizona. If the use of “poppers” were examined in persons who all engaged in the same kind of sexual activity, the activity might no longer be a confounder.
26.1.3Simpson’s Paradox
A particularly striking form of confounding, called Simpson’s Paradox, occurs when the extremely unbalanced distribution of an effect-modifier makes the total results go in a direction opposite from that found in components of the compared groups.
For example, consider the success results for the two treatments shown in Table 26.1. The success rates for Treatment B are higher than for Treatment A by an increment of 6.8% (= 34.5% − 27.7%), which is a proportionate increase of 22% from the common proportion of 31.3%. The value of X2 is 6.50, so that 2P < .025. The results thus seem stochastically and quantitatively significant in favor of Treatment B.
In the condition being treated, however, prognostic status is an effect modifier, for which the “good risk” group is Stratum I and the “poor risk” group is Stratum II. When examined within these two strata, the results produce the surprise shown in Table 26.2. In each stratum, the results for Treatment A are significantly better than for Treatment B, both quantitatively and stochastically.
The “Simpson’s-paradox” deception in the combined results of Table 26.1 arose from the markedly unbalanced distributions of treatment in the two strata. Treatment B was used predominantly for the good-risk patients of Stratum I, and Treatment A mainly for the poor-risk patients of Stratum II. This
TABLE 26.1
Illustration of Simpson’s Paradox, as Demonstrated Later in Table 26.2
|
Success |
Failure |
Total |
Rate of Success |
|
|
|
|
|
Treatment A |
158 |
412 |
570 |
27.7% |
Treatment B |
221 |
419 |
640 |
34.5% |
TOTAL |
379 |
831 |
1210 |
31.3% |
|
X2 |
= 6.50; 2P < .025 |
|
|
TABLE 26.2
Results in Strata for the Simpson’s Paradox of Table 26.1
|
Success |
Failure |
Total |
Rate of Success |
|
|
|
|
|
STRATUM I |
|
|
|
|
Treatment A |
10 |
10 |
20 |
50.0% |
Treatment B |
216 |
384 |
600 |
36.0% |
TOTAL |
226 |
394 |
620 |
36.5% |
|
|
X2 = 5.12; 2P < .025 |
|
|
STRATUM II |
|
|
|
|
Treatment A |
148 |
402 |
550 |
26.9% |
Treatment B |
5 |
35 |
40 |
12.5% |
TOTAL |
153 |
437 |
590 |
25.9% |
|
|
X2 = 4.03; 2P < .05 |
|
|
|
|
|
|
|
© 2002 by Chapman & Hall/CRC
type of imbalance might occur if Treatment B is an arduous procedure, such as “radical” therapy, that is reserved for the “healthier” patients, whereas the less arduous Treatment A is given mainly to the “sicker” patients.
26.1.4Scientific Precautions
The main scientific approach to confounding is to avoid it with suitably planned research. The plans are designed to reduce or eliminate bias when raw data are collected and when groups are formed for comparison. The prophylactic precautions can easily be applied if the investigator, doing the research as an experiment (e.g., a randomized clinical trial), can assign the agents to be compared and can plan the collection of data. In nonexperimental circumstances — such as the work done at national bureaus of health statistics, or in any other form of “retrospective” research — the agents were already received and the data were already recorded for diverse groups before the investigation begins. The prophylactic precautions must then be replaced by remedial adjustments.
One type of adjustment, called matching, is discussed later in Section 26.5. Another method of adjustment is done with multivariable analytic strategies. The most common and traditional form of epidemiologic adjustment is called standardization, which is discussed throughout the next section.
26.2 Statistical Principles of Standardization
The procedure called standardization was devised by demographers to convert the “crude” rate of a group into a “standardized” rate that is adjusted for unbalanced distributions of confounders. The key elements in the adjustment are the outcome rates and constituent proportions for component strata. The group is first divided into strata of different levels for the confounding variable. For example, as an effect-modifier for general mortality, age might be partitioned into such strata as ≤ 29, 30–49, 50–64, 65–79, and ≥ 80. For each stratum, the outcome rate is the occurrence of a selected target event, such as death or success, and the constituent proportion is the fraction occupied by that stratum in the group’s total composition. Thus, in Table 26.2, the success rate is 36.5% in Stratum I, which occupies 620/1210 =
.512 of the total population. For Stratum II, the corresponding values are 25.9% for the rate and 590/1210 = .488 for the proportion. For the adjustment process, the rates and proportions in the component strata are given appropriate “weights” and then combined into a single “standardized” result.
In adjusting general mortality rates, we begin with the total “crude” rate, decompose it into pertinent strata, and then weight and recombine the strata suitably. In other situations, when the results are expressed in odds ratios rather than rates, the weighted adjustments and recombinations are applied to stratum components of the ratios. The standardization process can thus be regarded as a statistical form of recombinant DNA: denominator-numerator-adjustments.
26.2.1Weighted Adjustments
The principles used in weighted adjustment are analogous to a mechanism taught in elementary school for finding the average of two rates (or proportions). For example, if you want to get the average of two rates, .34 and .20, the wrong approach is to add them directly as .34 + .20 = .54, and then divide by 2 to get .27 as the average. This tactic is wrong because the reported rates may come from groups having unbalanced denominators. Thus, if .34 = 51/150 and .20 = 2/10, the correct average result would be the much higher value of (51 + 2) / (150 + 10) = 53/160 = .33. Conversely if .34 = 10/29 and .20 = 96/480, the correct result is the much lower (10 + 96)/(29 + 480) = 106/509 = .21. To add the rates directly and then divide by 2 to produce .27 would be appropriate only if the two denominators had equal sizes, so that .34 = 17/50 and .20 = 10/50. The average would then be (17 + 10)/(50 + 50) = 27/100 = .27.
For the correct approach here, we added the two numerators, added the two denominators, and then divided the two sums. This basic strategy is used later to prepare “adjusted” values for ratios, and is also used for decomposing and then recombining rates.
© 2002 by Chapman & Hall/CRC
26.2.2Algebraic Mechanisms
To show how rates are weighted, suppose one rate is constructed as r1 = t1/n1 and the other as r2 = t2/n2. With the total group sizes being Ν = n1 + n2 in the denominators and T = t1 + t2 in the numerators, the “correct” average result is T/N.
The algebraic weighting process recapitulates this result by using the component proportions of the total denominator. They would be p1 = n1/N in the first group and p2 = n2/N in the second. The component proportions and rates for each group are then multiplied and added to produce p1r1 and p2r2 . Their sum is p1r1 + p2r2 = T/N. [To prove this point, note that p1 = n1/N and r1 = t1/n1, so their product is p1r1 =
t1/N. Similarly, p2 r2 = t2/N. The sum of the products is (t1 + t2 )/N = T/N.]
The general formula for constructing the average “crude” rate, R, from a series of component groups (or “strata”) is
R = Σ piri = T/N
The values of pi are the “weighting factors” for the rates, ri . In the foregoing examples for two groups, the result was wrong when we ignored weighting factors and calculated R as (r1 + r2)/2, instead of p1r1 + p2r2. In the first example, where N = 150 + 10 = 160, p1 was 150/160 = .9375 and p2 = 10/160 =
.0625. In the second example, p1 was 29/509 = .057 and p2 was 480/509 = .943. In the third example, p1 = 50/100 = .5 and p2 = 50/100 = .5. The immediate addition of rates gave the correct answer only in the third example, where p1 = p2 = .5 was the correct weighting factor for their sum. In any other adjustment that merely adds the ri values, the results would be wrong unless the appropriate pi values were used as the weighting factor for each stratum.
The basic arrangement of Σ piri is used for all the conventional epidemiologic adjustments that produce “standardized” mortality rates.
26.3 Standardized Adjustment of Rates
The customary standardization processes, discussed throughout this section, were developed to adjust the rates obtained in cohort or cross-sectional research. For the research structures, each denominator is either the “crude” total or a component stratum containing persons “at risk” for occurrence of the event cited in the numerator. The target rates are constructed as r = t/n for the total group and ri = ti/ni for a stratum. [The entities are cited here with lower-case symbols, because capital letters will be used later for the counterpart entities in a “standard” population.]
In retrospective case-control studies, however, rates of “risk” cannot be directly determined and compared. Instead, the “risk ratio” for contrasting two groups is estimated with an odds ratio, constructed as (a × d)/(b × c) from cross-products of cells in the 2 × 2 table that reflects (but does not show) the actual risks. The complexity of working with four component numbers, rather than two, in each stratum requires that odds ratios be adjusted with a process, discussed later in Section 26.4, different from what is used for rates in the subsections that follow.
26.3.1Identification of Components
Every “target” rate in a group of people can be “decomposed” into stratum proportions and stratum-specific target rates for the components. In a particular group of people whose success rate for Treatment X was 60/200 = 30%, the “decomposition” for old and young persons forms the stratification shown in Table 26.3. In this table, the “crude rate” is the total success rate of 30%, or .30, before stratification. The stratum-specific rates of .40 and .27 are shown directly, but the stratum proportions are not immediately evident. They are determined
TABLE 26.3
Hypothetical Success Rates for Young and Old Patients
Strata |
Success Rates |
|
|
Young |
20/50 (40%) |
Old |
40/150 (27%) |
TOTAL |
60/200 (30%) |
|
|
© 2002 by Chapman & Hall/CRC
when the denominator, ni, of each stratum is divided by n, the total size of the combined strata. In this instance, n1 = 50, n2 = 150, and n = 200, so that p1 = 50/200 = .25 and p2 = 150/200 = .75.
In general symbols, the data for a group can be partitioned into strata identified with the subscript i, as 1, 2, 3, …, i, …, k, where k is the total number of strata. Each stratum contains a denominator of members, ni, and a corresponding numerator, ti, of focal or target events in those members. The rate of the target event in each stratum will be ri = ti/ni. In Table 26.3, r1 = t1/n1 = 20/50 = .40 and r2 = t2/n2 = 40/150 = .27.
The total number of people in the group will be Σ ni = n; the total number of target events will be Σ ti = t; and the crude rate for the target event will be t/n = Σ ti/Σ ni. In Table 26.3, n = Σ ni = 50 + 150 = 200; t = Σ ti = 20 + 40 = 60, and r (the crude rate) = t/n = 60/200 =.30.
To avoid the sometimes ambiguous symbol, p, the stratum proportions can be symbolized with w (for “weights”) and calculated as wi = ni/n. In Table 26.3, they are w1 = 50/200 = .25 and w2 = 150/200 =
.75. Each stratum thus has a proportion, wi , which is ni /n. Each stratum also has a specific rate, ri, calculated as ti /ni , for the target event. As shown earlier in Section 26.2.2, the crude rate is
r = Σ wi ri
Applying this formula to the data in Table 26.3 produces (.25)(.40) + (.75)(.27) = .10 + .20 = .30, which is the observed crude rate.
26.3.2Principles of Direct Standardization
The wi proportions are the main source of problems when crude rates are compared. If the compared groups have substantially different wi values for the stratum proportions and if the corresponding strata have different target rates, the comparison of crude rates will be distorted by confounding. The imbalance in stratum proportions for baseline susceptibility was the source of the biased comparisons discussed earlier for the geographic regions of Connecticut vs. Florida, for surgical vs. nonsurgical treatment, and for Simpson’s Paradox in Table 26.1.
Unless your cerebrum has been overwhelmed by the mathematics, you may have already thought of a solution for the problem. Because the stratum-specific target rates, ri , are the main results and the wi component values produce the distortion, the imbalance can be “fixed” if the two compared groups are given the same set of wi values. An “adjusted” rate can then be calculated with these similar wi values for the appropriate strata in each group. This solution is exactly what happens in the process called direct standardization. The next main decision is to choose appropriate values of wi.
26.3.2.1 Choice of Standardizing Population — We could get a fair comparison by choos - ing the “standardizing” wi values from either of the compared groups. Thus, we could standardize the mortality results in Florida by using wi weights from the population of Connecticut, or standardize the Connecticut results with wi values from Florida. To avoid an invidious choice between the two sources, however, the standardizing population is customarily selected in a neutral manner. The usual choice of the reference or standard population (in epidemiologic research) is the composition and associated death rates for an entire national population. The results of the standardizing population can be shown with capital letters of W, R, T, and N, corresponding to the lower-case values of w, r, t, and n in the observed groups.
For example, if the selected standard population contains 10% young people and 90% old people, the standardizing Wi values are W1 = .10 and W2 = .90. If these Wi values are applied to the ri values in Table 26.3, the adjusted rate would be Σ Wiri = (.10)(.40) + (.90)(.27) = .04 + .24 = .28. With the adjustment, the previous crude level of 30% for the success rate would fall to a standardized value of 28%.
The choice of a standard population is tricky. In clinical situations (where standarization is seldom used), national data are almost never available for such important confounding factors as severity of disease. Accordingly, a standard “clinical” population is usually chosen ad hoc to be the total of the observed groups, irrespective of therapy or other group distinctions. Thus, for the data in Table 26.2, the standardizing weights (as noted in Section 26.2) would be W1 = .512 for Stratum I, and W2 =
© 2002 by Chapman & Hall/CRC