Ординатура / Офтальмология / Английские материалы / Using and Understanding Medical Statistics_Matthews, Farewell_2007
.pdf
Tables 21.1 and 21.2 illustrate a complex application of the proportional hazards regression model in the analysis of cohort data. By discussing this example, we have tried to illustrate some of the possibilities for analysis which use of this model facilitates. There are a number of issues which arise in the analysis of cohort data via relative risk regression models which are beyond the scope of this book. In our view, however, these models represent a natural choice for the analysis of epidemiological cohort studies. We hope it is also clear that careful thought must be given to the form of a regression model that is used to analyze epidemiological data, and to the interpretation of the results.
21.4. Odds Ratio Models
In this section we consider the case-control study reported by Weiss et al. [67]. This study identified and interviewed 322 cases of endometrial cancer occurring among white women in western Washington between January 1975 and April 1976. A random sample of 288 white women in the same age range were interviewed as controls. The interviews were used to obtain information on prior hormone use, particularly postmenopausal estrogen use, and on known risk factors for endometrial cancer.
Let Y be a binary variable which distinguishes cases (Y = 1) from controls (Y = 0). Then Y corresponds to the event of endometrial cancer incidence during the study period. If we also define a binary explanatory variable X which is equal to 0 unless a woman has used post-menopausal estrogens for more than one year (X = 1), then we require a statistical model which relates Y to X. A natural choice in this instance is the binary logistic regression model which was introduced in chapter 11. In terms of Y and X, the defining equation for this model is
|
|
|
|
|
|
Pr(Y = 1| x) |
|
|
|
|
|
=a +bx. |
(21.4) |
|
log |
|
|
||
|
||||
|
|
|
|
|
1 Pr(Y = 1| x) |
|
|
||
|
|
|
|
|
As we indicated in §11.3, eb is the odds ratio which compares the odds of being a case for a woman using estrogen to the same odds for a non-user. The probability of cancer incidence for a non-user is equal to exp(a)/{1 + exp(a)}.
In §21.2 we noted that, because the proportions of cases and controls are fixed by the study design, it is impossible to estimate the probability of cancer incidence, which depends on the parameter a, from a case-control study. Nevertheless, if case-control data are analyzed using a model like equation (21.4), then although the estimate of a has no practical value, the estimation of odds ratio parameters such as b can proceed in the usual fashion and provides valid
21 Epidemiological Applications |
268 |
Table 21.3. The results of an odds ratio regression analysis of endometrial cancer incidence in relation to exposure to exogenous estrogens and other factors; the analysis stratifies on baseline age
Risk factor |
Regression |
Definition |
ˆ |
SE |
a |
Significance |
|
b |
|
||||||
|
|
variable |
|
|
|
|
levelb |
Estrogen use |
X1 |
1 if duration of use between |
1.37 |
0.24 |
<0.0001 |
||
|
|
|
1 and 8 years; 0 otherwise |
|
|
|
|
|
|
X2 |
1 if duration of use 8 years |
2.60 |
0.25 |
<0.0001 |
|
|
|
|
or greater; 0 otherwise |
|
|
|
|
|
|
|
|
|
|
||
Obesity |
X3 |
1 if weight greater than |
0.50 |
0.25 |
0.04 |
||
|
|
|
160 lbs; 0 otherwise |
|
|
|
|
|
|
|
|
|
|
||
Hypertension |
X4 |
1 if history of high blood |
0.42 |
0.21 |
0.05 |
||
|
|
|
pressure; 0 otherwise |
|
|
|
|
Parity |
|
X5 |
1 if number of children 2 |
0.81 |
0.21 |
0.0001 |
|
|
|
|
or greater; 0 otherwise |
|
|
|
|
|
|
|
|
|
|
|
|
a |
|
|
ˆ |
|
|
|
|
|
Estimated standard error of b . |
|
|
|
|
||
b Estimated significance level for testing the hypothesis b = 0.
estimates of odds ratios in the population under study. The justification for this claim is beyond the scope of this book, but if readers are willing to accept it at face value, we can proceed with an illustration of how logistic regression models can be used to analyze case-control studies.
In §21.2 we mentioned the importance of age and the possible matching of cases and controls on age in case-control studies. Chapter 5 shows, in the context of 2 ! 2 tables, that matching is a special case of stratification. In particular, pair matching, i.e., selecting one specific control for each case, corresponds to the use of strata of size two. Section 11.3 introduced the stratified version of a logistic regression model in the special case of two strata. A more general version of this stratified model is specified by the equation
|
Pr(Y =1| x) |
|
k |
|
|
|
|
= |
+ |
bi xi , j =1, , J |
(21.5) |
log |
|
|
aj |
|
|
|
|
i =1 |
|
|
|
1 Pr(Y =1| x) |
|
|
|||
|
|
|
|
|
|
where j indexes the strata and the bi’s are the logarithms of odds ratios associated with the covariates in X= {X1, X2, ..., Xk}. As was the case in §11.3, the bi’s are assumed to be the same˜for each stratum.
Table 21.3 presents an analysis of the data reported by Weiss et al. [67] when cases and controls are grouped in strata defined by one-year age inter-
Odds Ratio Models |
269 |
vals. The design of a case-control study usually includes some degree of matching on age; this ensures moderate balance between cases and controls in agedefined strata. Stratification on one-year age intervals is not always possible, but most studies would likely support strata based on five-year age intervals. With these one-year age intervals and women aged 50–74 years, problems would arise if we tried to estimate all 24 of the aj’s. The possibility of problems like this, and a method for avoiding them, was alluded to in §11.3. Thus, table 21.3 is based on a conditional analysis of model (21.5). This method of analysis adjusts for the stratification, but provides estimates of the bi’s without having to estimate the aj’s. We will not explain how a conditional analysis is carried out, but simply assure the reader that it does not alter the interpretation of the estimated bi’s which was presented in chapter 11.
The regression vector, X, for each subject in the case-control study of endometrial cancer includes a˜binary variable, X1, which is equal to zero unless the subject had a duration of use of exogenous estrogen of between one and eight years (X1 = 1), a secondary variable, X2 , which is equal to zero unless the subject had a duration of estrogen use in excess of eight years (X2 = 1), and additional indicator variables for obesity, hypertension and parity. On the basis of calculations which were outlined in §11.3, we conclude that estrogen use of between one and eight years is associated with an estimated odds ratio for endometrial cancer of exp(1.37) = 3.94; the associated 95% confidence interval for the odds ratio is (2.46, 6.30). The corresponding estimate of the odds ratio and confidence interval associated with eight years or more of estrogen use are exp(2.60) = 13.46 and (8.25, 21.98), respectively.
21.5. Confounding and Effect Modification
In the epidemiological literature, the terms confounding factor and effect modifying factor receive considerable attention. An epidemiological study is frequently designed to investigate a particular risk factor for a disease, and it is common practice to refer to this risk factor as an exposure variable, i.e., exposure to some additional risk. A confounding factor is commonly considered to be a variable which is related to both disease and exposure. Variables which have this property are discussed in §5.2 and will tend to bias any estimation of the relationship between disease and exposure. An effect modifying factor is a variable which may change the strength of the relationship between disease and exposure. The odds ratio, which associates exposure and disease, would vary with the level of an effect modifying variable. Confounding variables, and to a lesser extent effect modifying variables, are treated somewhat differently than exposure variables of direct interest because the former tend to be factors
21 Epidemiological Applications |
270 |
which are known to be related to disease. Therefore, it is necessary to adjust for confounding and effect modifying factors in any discussion of new risk factors. Although in formal statistical procedures this distinction between variables is not made, it can be an important practical question. Consequently, we propose to indicate how these concepts can be viewed in the context of regression models for case-control studies.
The analysis of epidemiological data using regression models like (21.3) and (21.5) will only identify something we choose to call ‘potential’ confounding variables. This is because the interrelation of the covariates is irrelevant in such an approach. Thus, the relationship between exposure and a potential confounding variable is never explored. Provided that the regression model is an adequate description of the data, its use will prevent a variable which is confounding in the common epidemiological sense from biasing the estimation of the odds ratio.
Effect-modification in a logistic regression model corresponds to something called an interaction effect in the statistical literature. Sections 15.3 and 15.4 contain an extensive discussion of the notion of interaction in an analysis of variance setting. Additional occurrences of interaction terms in regression models that we fit to data may be found in §§11.5 and 12.3. Although the use of the term effect modifier is a historical fact, and is unlikely to disappear quickly from the epidemiological literature, Breslow and Day [68] indicate that ‘the term is not a particularly happy one however’.
It is wise to regard most regression models as convenient, empirical descriptions of particular data sets. An interaction effect is a concept which only has meaning within the context of a particular model. If an interaction is identified, the aim of subsequent analyses, which may involve alternative regression models, should be to understand the nature of the data and of the biological process which has resulted in the empirical description obtained.
For example, table 21.4 adds two additional variables to the model summarized in table 21.3. The variable X1X4 is an interaction term which is the product of X1 and X4; X1X4 is one only if a woman is hypertensive and used estrogen for one to eight years, and is zero otherwise. A second interaction term, X2X4, identifies women who are hypertensive and who have used estrogen for more than eight years (X2X4 = 1).
According to the analysis summarized in table 21.4, the coefficient associated with X1X4 is significantly different from zero. Thus, estrogen use of one to eight years would have an estimated odds ratio of exp(1.78) = 5.93 for nonhypertensive women and exp(1.78 – 1.04) = 2.10 for hypertensive women. This indicates that the effect of moderate estrogen exposure is modified by hypertensive status. Alternatively, if we adopt the approach described in §13.3, we can interpret this significant interaction as indicating that exp(1.78 + 0.60 –
Confounding and Effect Modification |
271 |
Table 21.4. The results of an odds ratio regression analysis of endometrial cancer incidence incorporating interaction terms for hypertension and estrogen use (cf. table 21.3); the analysis stratifies on baseline age
Risk factor |
Regression |
Definition |
ˆ |
SE |
a |
Significance |
|
b |
|
||||||
|
|
variable |
|
|
|
|
levelb |
Estrogen use |
X1 |
1 if duration of use between |
1.78 |
0.37 |
<0.0001 |
||
|
|
|
1 and 8 years; 0 otherwise |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
X2 |
1 if duration of use 8 years or |
2.40 |
0.35 |
<0.0001 |
|
|
|
|
greater; 0 otherwise |
|
|
|
|
Obesity |
X3 |
1 if weight greater than 160 |
0.75 |
0.33 |
0.02 |
||
|
|
|
lbs; 0 otherwise |
|
|
|
|
|
|
|
|
|
|
||
Hypertension |
X4 |
1 if history of high blood |
0.60 |
0.31 |
0.05 |
||
|
|
|
pressure; 0 otherwise |
|
|
|
|
Parity |
|
X5 |
1 if number of children 2 or |
0.58 |
0.31 |
0.06 |
|
|
|
|
greater; 0 otherwise |
|
|
|
|
Interaction |
X1X4 |
1 if both X1 and X4 equal 1; |
–1.04 |
0.50 |
0.04 |
||
terms |
|
|
0 otherwise |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
X2X4 |
1 if both X2 and X4 equal 1; |
0.53 |
0.53 |
0.32 |
|
|
|
|
0 otherwise |
|
|
|
|
|
|
|
|
|
|||
Maximized log-likelihood = –283.08. |
|
|
|
|
|||
a |
|
|
ˆ |
|
|
|
|
|
Estimated standard error of b. |
|
|
|
|
||
b Estimated significance level for testing the hypothesis b = 0.
1.04), the individual odds ratio associated with being hypertensive and a moderate duration estrogen user, is less than the product of the odds ratios associated with hypertension, i.e., exp(0.60), and estrogen use of one to eight years, i.e., exp(1.78), separately. In any event, it appears that estrogen use of one to eight years duration does not further increase a hypertensive woman’s risk of endometrial cancer.
The arbitrary grouping of the duration of estrogen use which appears in tables 21.3 and 21.4 may not provide the simplest empirical model. Table 21.5 presents a model where estrogen use is defined to be the logarithm of (estrogen duration use + 1); the constant value 1 is added to prevent infinite values of the covariate. This model fits the data of Weiss et al. [67] as well as the model summarized in table 21.4. That this is the case can be determined by comparing maximized log-likelihoods, a technique which was mentioned briefly at the end of §16.6. The formal details of the assessment are a bit more complicated
21 Epidemiological Applications |
272 |
Table 21.5. The results of an odds ratio regression analysis of endometrial cancer incidence which models estrogen use with a continuous covariate (cf. table 21.4); the analysis stratifies on baseline age
Risk factor |
Regression |
Definition |
ˆ |
SE |
a |
Significance |
|
b |
|
||||||
|
|
variable |
|
|
|
|
levelb |
Estrogen use |
X1 |
logarithm of |
0.98 |
0.13 |
<0.0001 |
||
|
|
|
(duration of use + 1) |
|
|
|
|
Obesity |
X3 |
1 if weight greater than |
0.67 |
0.33 |
0.04 |
||
|
|
|
160 lbs; 0 otherwise |
|
|
|
|
|
|
|
|
|
|
||
Hypertension |
X4 |
1 if history of high blood |
0.45 |
0.31 |
0.15 |
||
|
|
|
pressure; 0 otherwise |
|
|
|
|
Parity |
|
X5 |
1 if number of children 2 or |
0.59 |
0.31 |
0.06 |
|
|
|
|
greater; 0 otherwise |
|
|
|
|
|
|
|
|
|
|
||
Interaction |
X1X4 |
X1 if X4 equals 1; |
–0.02 |
0.18 |
0.91 |
||
|
|
|
0 otherwise |
|
|
|
|
|
|
|
|
|
|||
Maximized log-likelihood = –286.543. |
|
|
|
|
|||
a |
|
|
ˆ |
|
|
|
|
|
Estimated standard error of b. |
|
|
|
|
||
bEstimated significance level for testing the hypothesis b = 0.
here. However, it can be seen that the analysis presented in table 21.5 requires only one variable to model the effect of estrogen use, and the interaction between estrogen use and hypertension is no longer significant. Thus, an interaction which was present in one empirical description of the data need not appear in another.
21.6. Mantel-Haenszel Methodology
Historically, and for convenience, many epidemiological studies have concentrated on a binary exposure variable and examined its relationship to disease after adjustment for potential confounding variables. Therefore, the analysis of such studies has often been based on stratified 2 ! 2 tables of the type discussed in chapter 5 and shown in table 21.6.
In chapter 11 and in this chapter, we have indicated how logistic regression models can be used to estimate odds ratios from data of this type. However, the use of logistic regression models does require a reasonably sophisticated computer package, and simpler methods of estimation have a long history of
Mantel-Haenszel Methodology |
273 |
Table 21.6. A 2 ! 2 table summarizing the binary data for level i of a confounding factor
|
Confounding factor level i |
Total |
|
|
success |
failure |
|
|
|
|
|
Group 1 |
ai |
Ai–ai |
Ai |
Group 2 |
bi |
Bi–bi |
Bi |
Total |
ri |
Ni–ri |
Ni |
|
|
|
|
use. The most widely used estimate of , the common odds ratio, based on stratified 2 ! 2 tables is the Mantel-Haenszel estimate, for which the defining equation is
|
k |
|
|
|
ai (Bi bi )/Ni |
|
(21.6) |
ORMH = |
i =1 |
, |
|
k |
|
||
|
(Ai ai )bi/Ni |
|
|
i =1
where there are k distinct 2 ! 2 tables of the type illustrated in table 21.6. Although it may not be apparent from equation (21.6), this estimate is a weighted average of the individual odds ratios {ai(Bi – bi)}/{(Ai – ai)bi} in the k 2 ! 2 tables. For completeness, the weight for the ith table is (Ai – ai)bi/Ni, a quantity which approximates the inverse of the variance of the individual estimate of the odds ratio in the ith table when is near 1. Note also that OˆRMH is easy to compute and is not affected by zeros in the tables. Research has shown that the statistical properties of this estimate compare very favourably with the corresponding properties of estimates which are based on logistic regression models.
In chapter 5, we introduced the summary 2 statistic for testing the independence of exposure and disease. In the epidemiological literature, this same statistic is often called the Mantel-Haenszel 2 statistic for testing the null hypothesis that the common odds ratio is equal to 1. There is an extensive statistical literature concerning variance estimates for OˆRMH. Perhaps the most useful estimate is one proposed by Hauck [69] for use in situations when the number of tables would not increase if the study was larger. Its formula, which is not all that simple, is
|
|
k |
2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
V =ORMH Si /Wi |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
i =1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(Ai ai )bi |
|
|
|
1 |
|
1 |
|
|
1 |
|
1 |
|
1 |
|
where Si |
= |
|
|
and Wi |
= |
|
|
+ |
|
+ |
|
|
+ |
|
|
. |
|
|
|
|
|
|
|
|
|
||||||||
|
|
|
Ni |
|
|
|
+1/2 |
|
Ai ai +1/2 |
|
bi |
+1/2 |
|
Bi bi |
+1/2 |
|
|
|
|
|
ai |
|
|
|
|
||||||||
21 Epidemiological Applications |
274 |
Table 21.7. The Doll-Hill [71] cohort study of smoking and coronary deaths in British male doctors
Age range |
Smokers |
|
Non-smokers |
|
||
|
|
deaths |
person-years |
|
deaths |
person-years |
|
|
|
|
|
||
35–44 |
32 |
52,407 |
2 |
18,790 |
||
45–54 |
104 |
43,248 |
12 |
10,673 |
||
55–64 |
206 |
28,612 |
28 |
5,710 |
||
65–74 |
186 |
12,663 |
28 |
2,585 |
||
75–84 |
102 |
5,317 |
31 |
1,462 |
||
|
|
|
|
|
|
|
The fraction ½ is added to each of the denominators in Wi to avoid division by zero. Robins et al. [70] discuss other variance estimates that can be used in a wide variety of situations and also are not too difficult to compute.
Because of its good statistical properties, the Mantel-Haenszel estimator of a common odds ratio, , can be recommended for use by researchers with few resources for the complicated calculations of logistic regression. However, readers should remember that an analysis which is based on a logistic regression model is equivalent to a Mantel-Haenszel analysis, and offers the additional generality and advantages of regression modelling.
21.7. Poisson Regression Models
The Doll-Hill [71] cohort study of smoking and coronary deaths among British male doctors was a landmark epidemiological investigation. Table 21.7 provides data from this study as recorded by Breslow and Day [23] and also used by McNeil [72]. The data consist of the number of deaths tabulated by smoking status and age group. In addition, for each smoking status and age group category, the number of years that smoking and non-smoking doctors in the study were observed while belonging to the various age groups are listed. This information is called person-years of follow-up. For example, a smoking doctor who entered the study at age 42 and died or was lost to follow-up at age 57 would belong to the age group 35–44 years for the first two years of followup, then to the age group 45–54 for ten years, and finally to the age group 55– 64 for three years. This doctor would contribute 2, 10, 3, 0 and 0 person-years to the five successive entries tabulated in column three of table 21.7. Observed deaths in the study are recorded in columns two and four of the table in the row corresponding to the doctor’s age at death.
Poisson Regression Models |
275 |
The outcome or response variable for this study, which we will denote by Y, is the number of deaths due to coronary heart disease. The explanatory variables of interest are smoking status, which is used to define a binary variable (0 identifies non-smokers, and 1 denotes smokers), and the ten-year age groups. Since this age group information is a categorical variable with five levels, four binary variables identify the four oldest age ranges and compare each group with the youngest age range (35–44), which would correspond to all four binary variables being coded 0. Readers may recall that we previously encountered the use of k – 1 indicator variables to encode a categorical explanatory variable with k levels in §16.4 and also in chapter 15.
Since the response variable is a count, Poisson regression is a natural method to use in analyzing these data. In chapter 12, we identified the model equation for Poisson regression as
log = a + ki = 1 biXi ,
where is the expected or mean count from a Poisson distribution. This regression equation allows the expected counts to depend on the explanatory variables. However, in table 21.7 the counts in columns 2 and 4 depend not only on the explanatory variables but also on the number of person-years of observation that correspond to the observed counts. Therefore, we need to modify the model equation for Poisson regression so that the expected counts also reflect this dependence on person-years.
The necessary modification is equivalent to modelling the event rate, which corresponds in the Doll-Hill study to the death rate rather than the expected counts. If PY denotes the number of person-years corresponding to a particular death count, Y, then the death rate, i.e., number of deaths per per- son-year of follow-up, is Y/PY. The resulting modified model equation is
log( PY) = a + ik = 1 biXi .
With this modification, using Poisson regression will be sensible and the results of any analyses will have the same form as we previously described in chapter 12.
Table 21.8 presents the results of fitting two Poisson regression models of this modified type to the data summarized in table 21.7. The model labelled a includes only smoking status as an explanatory variable, and demonstrates the strong relationship between smoking status and the physician death rates due to coronary heart disease. However, in this study, as in many cohort studies, age is an important determinant of mortality. Therefore, the analysis labelled b summarizes the results of a Poisson regression in which both smoking status and physician age are represented in the model equation by the five binary explanatory variables X1 and X2 through X5, respectively. The estimated regres-
21 Epidemiological Applications |
276 |
Table 21.8. The results of two Poisson regression analyses of the relationship between coronary deaths and smoking status in British male doctors
Model |
Explanatory |
Estimated |
Estimated |
Significance |
|
variable |
regression |
standard |
level |
|
|
coefficient |
error |
|
|
|
|
|
|
aWithout adjusting for age
a |
–5.96 |
0.10 |
|
Smoker |
0.54 |
0.11 |
<0.001 |
bAdjusting for age
a |
–7.92 |
0.19 |
|
Smoker |
0.36 |
0.11 |
0.0005 |
Age 45–54 |
1.48 |
0.20 |
<0.0011 |
55–64 |
2.63 |
0.18 |
|
65–74 |
3.35 |
0.19 |
|
75–84 |
3.70 |
0.19 |
|
1 Likelihood ratio test.
sion coefficients associated with membership in one of the ten-year age groups in the study are significantly different from zero, highlighting the important role that age plays in coronary deaths. Although the regression coefficient associated with smoking status in the latter analysis, i.e., b, is slightly smaller than the corresponding estimate obtained without adjusting for age, there is still strong evidence against the hypothesis of no relationship between smoking status and coronary deaths. By including the age group information in the regression model, we know that the importance of smoking status cannot be attributed to any confounding between smoking and age, since the information about age is included in the model when we test the hypothesis that the regression coefficient associated with smoking status is equal to zero.
In the context of this study, the relative rates of events which are specified by a Poisson regression model can be interpreted as relative risk. The event of interest in the Doll-Hill study is death due to coronary heart disease, and the mean of the assumed Poisson distribution can be interpreted as the expected death rate. Therefore, if bj is the regression coefficient associated with a particular binary explanatory variable, such as smoking status, exp(bj) represents a ratio of death rates, one for smokers and one for non-smokers, i.e., the relative risk of death associated with smoking. Thus, the key feature of the analysis that we can extract from the age-adjusted results in table 21.8 is the estimated relative risk of coronary death associated with smoking, adjusted for age, which
ˆ
is exp(0.36) = 1.43. And if we use the estimated standard error for b1, which is
Poisson Regression Models |
277 |
- #
- #
- #28.03.202681.2 Mб0Ultrasonography of the Eye and Orbit 2nd edition_Coleman, Silverman, Lizzi_2006.pdb
- #
- #
- #
- #28.03.202621.35 Mб0Uveitis Fundamentals and Clinical Practice 4th edition_Nussenblatt, Whitcup_2010.chm
- #
- #
- #28.03.202627.87 Mб0Vaughan & Asbury's General Ophthalmology 17th edition_Riordan-Eva, Whitcher_2007.chm
- #
