Ординатура / Офтальмология / Английские материалы / Using and Understanding Medical Statistics_Matthews, Farewell_2007
.pdf
for the death rate represent patient characteristics which are defined at the start of the follow-up period, and this initial classification does not change throughout the analysis. Covariates of this type are said to be fixed with respect to time, and commonly arise in clinical studies. However, in some situations it may be desirable, and appropriate, to examine the influence on a hazard rate of patient characteristics which change over time. In the remainder of this section, we describe such a situation and illustrate the ease with which time-dependent covariates can be incorporated into proportional hazards regression. In our view, it is a very attractive feature of this regression approach to the analysis of survival data.
Following bone marrow transplantation for the treatment of acute leukemia, an important outcome event is the recurrence of the disease. The rate of leukemic relapse can be modelled using a proportional hazards regression of the time from transplantation to leukemic relapse. Another serious complication which may arise in the immediate post-transplant period is acute graft- versus-host disease (GVHD), which is thought to be an immunologic reaction of the new marrow graft against the patient. The interrelationship of these two adverse outcomes is of particular interest.
Prentice et al. [26] examine this interrelationship by incorporating information on the occurrence of GVHD in a proportional hazards regression model for leukemic relapse following bone marrow transplantation. However, the development of GVHD in a patient is not a predictable phenomenon. Therefore, it would be quite inappropriate to model the effect of GVHD on the relapse rate by using a covariate which ignores this fact, i.e., by using a fixed covariate which classifies an individual as having GVHD throughout the posttransplant period. The simplest possible way to incorporate this temporal dependence would involve the use of a binary covariate which is equal to zero at times prior to a diagnosis of GVHD, but takes the value one at all times thereafter. If more comprehensive data are available, perhaps indicating the severity of GVHD, this information could also be incorporated into the regression model through the use of suitably defined time-dependent covariates.
In §13.2, we used the notation, d(t), to represent the dependence of a death rate on time. Here we will denote a relapse rate by r(t) and, in a similar fashion, use X(t) to show the dependence of a covariate on time. Let X(t) = {X1(t), ...,
˜
Xk(t)} represent the set of covariates in the regression model. Then for an in-
dividual with observed values of the covariates x(t) = {x1(t), ..., xk(t)}, a regres-
˜
sion equation for the leukemic relapse rate, which parallels equation (13.1), is
k
log{r(t; x(t))} =log{r (t)} + b x (t).
0 i i i=1
13 Proportional Hazards Regression |
158 |
Table 13.4. The results of a proportional hazards regression analysis of leukemia relapse data based on 135 patients treated for acute leukemia by means of bone marrow transplantation
Covariate |
Estimated regression |
Estimated |
Test |
|
coefficient |
standard error |
statistic |
|
|
|
|
GVHD |
–0.76 |
0.37 |
2.05 (p = 0.04) |
Transplant type |
0.05 |
0.34 |
0.15 (p = 0.88) |
Age |
0.13 |
0.10 |
1.30 (p = 0.19) |
Adapted from Prentice et al. [26] with permission from the publisher.
The addition of time-dependent covariates to the model is a very natural extension of equation (13.1), which already included a dependence on time. Although this refinement of the proportional hazards model appears to be simply a change in notation, it represents a major advance in biostatistical technique. There are many subtleties associated with its use and interpretation which we are not able to discuss adequately in these few pages. Readers are strongly urged to consult a statistician from the very beginning of any proposed study that may eventually involve the use of time-dependent covariates in a regression model.
Table 13.4, which is taken from Prentice et al. [26], presents the results of an analysis of data on leukemic relapse in 135 patients. Since the sample includes 31 syngeneic (identical twin) bone marrow transplants with no risk of GVHD, the regression model includes a binary covariate indicating the type of transplant (0 = syngeneic, 1 = allogeneic) and a continuous covariate representing patient age (years/10). As the results of this analysis indicate, neither of these covariates is significantly associated with leukemic relapse. However, their inclusion in the model adjusts the estimation of the GVHD effect for the influence of transplant type and patient age. Even after adjusting for the effect of these variables, the regression coefficient for GVHD is significant at the 0.05 level. The rate of leukemic relapse for patients who develop GVHD is estimated to be exp(–0.76) = 0.47 times the relapse rate for patients who do not have GVHD at the same time post-transplant. The corresponding 95% confidence interval for this relative risk is (e–1.46, e–0.03) = (0.23, 0.97).
The results of this analysis suggest that the occurrence of GVHD is protective with respect to leukemic relapse. This may indicate that GVHD serves to eradicate residual or new leukemic cells. The clinical implications of this finding are, of course, subtle and will not be discussed here. However, it suggests that although severe GVHD is clearly undesirable, a limited graft-versus- host reaction could help to control leukemic relapse.
The Use of Time-Dependent Covariates |
159 |
14
U U U U U U U U U U U U U U U U U U U U U U U U U U U
The Analysis of Longitudinal Data
14.1. Introduction
In nearly all the examples that we have discussed in previous chapters, each study subject has yielded one value for the outcome variable of interest. However, many medical studies involve long-term monitoring of participants; therefore, the repeated measurement of an outcome variable is both feasible and likely. Sometimes, it may be reasonable to focus on one particular value in a series of measurements. More often, the full set of outcome variables measured will be of interest.
A variety of statistical methods have been developed for analyzing data of this type, i.e., longitudinal data. The essential difference between the various methods of analysis that we have discussed in previous chapters and the approach required for longitudinal data is that the model must account for the correlation between repeated observations on the same subject. That is, two observations on the same individual will tend to be more similar than two individual measurements taken on distinct subjects.
The classic example of a statistical method for analyzing studies involving more than one measurement on each subject is known as repeated measures analysis of variance. This topic is introduced in chapter 15, where we also provide additional details concerning the analysis of variance. However, that material is more technical than most subjects that we address in this book, so, in this chapter, we will avoid any discussion of analysis of variance. Instead, we will discuss three examples of longitudinal studies that allow us to illustrate some recently developed, quite general methods of analyzing longitudinal data.
14.2. Liang-Zeger Regression Models
14.2.1. The Study
The first study to be discussed is one concerning the relationship between the use of recreational drugs during sexual activity and high-risk sexual behavior in a cohort of 249 homosexual and bisexual men during a five-year period. The cohort was monitored approximately every three months and, at each follow-up visit, participants were interviewed in private by a trained interviewer.
For the purposes of illustration, we will describe only a simplified analysis of this study; readers who are interested in a more comprehensive discussion should consult Calzavara et al. [27]. Based on the interview conducted at a fol- low-up visit, each study subject was assigned a summary sexual activity risk score which we shall denote as RS. This score was designed to summarize both the risk level of the sexual activities in which the subject had participated during the previous three months and the number of partners with whom these activities had been performed. The average RS value across all subjects declined from a high of 152.2 on the first follow-up visit to a low of 60.0 on the 17th time the cohort was monitored. A logarithmic transformation of the RS observations was used to make the distribution of this outcome variable at each monitoring occasion similar to a normal distribution. In the remainder of this section, we will denote the logarithm of the RS measurement by the variable Y.
Although various explanatory variables were investigated in this study, the analysis that we describe below will use only three. The first, X1, is an ordinal variable denoting the sequence number of the follow-up visit; the values of X1 range from 1 to 20. The second, X2, is a binary variable indicating that the study participant had used recreational drugs in conjunction with a sexual encounter in the previous three months (X2 = 1); otherwise, X2 = 0. The final explanatory variable, X3, identifies whether the subject was HIV-1 seropositive at the preceding follow-up visit (X3 = 1) or HIV-1 seronegative (X3 = 0).
14.2.2. The Regression Model
It is natural to adopt a regression model to study the relationship between Y, the logarithm of RS, and the three explanatory variables. Symbolically, the equation
Y = a + b1X1 + b2X2 + b3X3
for this regression model is similar to the one we used in chapter 10 to describe the relationship between brain weight, body weight and litter size in preweaning mouse pups. However, the analysis in chapter 10 was based on the assump-
Liang-Zeger Regression Models |
161 |
tion that each value of Y measured was independent of all other values. Since the brain weight measurements used in chapter 10 were obtained from different litters, this independence assumption seems appropriate. In the present study, the same assumption of independence is unreasonable since a single subject may contribute up to 20 values of Y.
In recent years, Liang and Zeger [28, 29] have developed a method of analyzing longitudinal data using regression models. Their approach is based on an assumption about the correlation between observations on the same subject. A discussion of the range of possibilities that might be adopted in analyzing longitudinal data is beyond the scope of this book. The simplest assumption, and the one that we will adopt in analyzing the present study, is that the correlation between pairs of observations from the same subject does not vary between pairs; we will denote the unknown value of this common correlation by the Greek letter . However, pairs of Y values that were obtained from different individuals are still assumed to be independent, which means their correlation is zero. In order to estimate the coefficients a, b1, b2, and b3, in the regression model, it is necessary to incorporate the additional parameter in the estimation procedure.
The Liang-Zeger approach to analyzing data from longitudinal studies involves two notable advantages. First, the method can be used with many different types of regression models. For example, if the response or outcome variable is binary and we wish to use a logistic regression model, the method of analyzing the data is essentially unchanged. The second advantage derives from the method of estimation, which is called generalized estimating equations (GEE); this methodology is often referred to as GEE regression models. According to statistical theory, the estimated regression coefficients are valid even if the correlation assumptions on which the analysis is based are not precisely correct. In most situations involving longitudinal data, the critical component in a sensible analysis is to incorporate some assumption about correlation so that the unreasonable premise that repeated measurements on the same subject are independent can be avoided. The Liang-Zeger approach confers the additional advantage that the ‘robust’ method of estimation even accommodates uncertainty about the most appropriate assumption concerning correlation.
One consequence of this ‘robust’ estimation procedure is that while the value of is estimated, there is usually no corresponding estimated standard error. However, the importance of the correlation parameter is, in most instances, secondary, since the analysis primarily concerns the relationship between the outcome variable, Y, and the associated explanatory variables.
14 The Analysis of Longitudinal Data |
162 |
Table 14.1. The results of a Liang-Zeger regression analysis of longitudinal RS data obtained from a cohort of 249 homosexual and bisexual men
Covariate |
Estimated |
Estimated |
Test |
Significance |
|
regression |
standard |
statistic |
level |
|
coefficient |
error |
|
(p-value) |
|
|
|
|
|
a |
3.736 |
0.120 |
– |
– |
Visit number |
–0.054 |
0.007 |
7.71 |
<0.001 |
Drug use |
0.550 |
0.085 |
6.47 |
<0.001 |
Seropositive |
0.443 |
0.254 |
1.74 |
0.081 |
|
|
|
|
|
= 0.544. |
|
|
|
|
|
|
|
|
|
14.2.3. Illustrative Results
Table 14.1 summarizes the results of fitting the regression model outlined in the previous section to the RS data.
The tabulated values are interpreted in the same way that estimated regression coefficients were understood in previous chapters involving regression models. A regression coefficient that is significantly different from zero represents an association between the outcome variable and the corresponding covariate, after adjusting for all other covariates included in the model. A statistical test of the hypothesis that the regression coefficient is zero can be based on the magnitude of the ratio of the estimated coefficient to its standard error; this ratio is compared to critical values from the modulus of a normal distribution with mean zero and variance one (cf. table 8.1).
According to the results presented in table 14.1, there is a demonstrable relationship between the visit number and the RS measurement; the negative sign of the estimated regression coefficient indicates that the mean of Y tended to decline as the study progressed. Since the regression coefficient associated with X2 is positive and significantly different from zero, the use of recreational drugs during sexual encounters is associated with an increase in the mean value of Y. Finally, although there is an estimated increase in the mean response associated with HIV-1 seropositive status, these data provide no evidence that the increase is significantly different from zero. Since Y denotes the logarithm of the RS value, the corresponding conclusions with respect to the original RS measurement are that the mean value declined substantially as the study progressed; however, whenever recreational drugs were used during sexual encounters, the mean RS value at the succeeding interview tended to be higher.
Based on the results of this simplified analysis, the findings of this longitudinal study would appear to be that, on average, high-risk sexual activity in
Liang-Zeger Regression Models |
163 |
the cohort has declined over time. Nonetheless, high-risk activities, when they occur, tend to involve recreational drug use with a sexual encounter.
14.3. Random Effects Models
14.3.1. The Study
In a clinical investigation of ovulation, diabetic and healthy women were followed for various periods of time during which each ovulatory cycle that a subject experienced was classified as abnormal or normal. The hypothesis of interest was whether diabetic women had a higher frequency of anovulatory cycles than non-diabetic women.
The study involved 23 diabetic women and 58 who were not diabetic. The number of cycles classified for each woman varied from 1 to 12. Thus, this investigation represents an example of a data set consisting of relatively short sequences of binary data observed on a moderately large number of women. Since the character of cycles in the same woman should be similar, each observed cycle cannot be regarded as an independent observation. Therefore, as we indicated in our discussion of the previous example in this chapter, our analysis of the study data should take account of the correlation between ovulatory cycles of the same woman.
14.3.2. The Regression Model
Liang-Zeger regression models are often characterized as marginal regression models because the regression model itself looks just like one that might be used if only a single response measurement was available for each subject. Therefore, in some sense, it represents a model for any randomly selected observation from the population. In addition to being referred to as a marginal regression model, a Liang-Zeger regression model is also sometimes called a population-averaged model.
For binary data, the Liang-Zeger approach would use a logistic regression model. To analyze the study of ovulatory cycles in diabetic and healthy women, such a model could take the form
Pr(Y =1| x) = |
|
exp(a +bx) |
(14.1) |
|
+exp(a +bx) |
||
1 |
|
||
that we encountered in chapter 11, where Y = 1 denotes an abnormal cycle and Y = 0 a normal cycle. The explanatory variable X can be used to denote diabetic status with X = 1 corresponding to a diabetic woman and X = 0 otherwise. Then the associated regression coefficient, b, is the logarithmic odds ratio of an abnormal ovulatory cycle, and exp(b) is the corresponding odds ratio, re-
14 The Analysis of Longitudinal Data |
164 |
flecting the effect that being diabetic has on the probability of experiencing an abnormal cycle.
An alternative to the Liang-Zeger approach to dealing with two or more correlated observations from a woman in the study is to use a regression model that is similar to the stratified logistic regression model that we introduced in §11.3; see equation (11.2). This alternative regression model is represented by the equation
Pr(Y =1| x) = |
exp(ai +bx) |
(14.2) |
1+exp(ai +bx) |
where the subscript i indexes all the women in the study. By adopting a distinct value, ai, of the intercept for each woman, this model specifies that the overall rate of an abnormal ovulatory cycle can vary arbitrarily among women. The model also assumes that this variation in the values of ai can account for the correlation between observations that were obtained from the same woman. Note, however, that the effect of being diabetic on the probability of having an abnormal ovulatory cycle, which is measured by b, is assumed to be the same for all women in the study.
With many women and small numbers of observations from some subjects, it is not possible to estimate the large number of subject-specific parameters, i.e., the 81 values of ai. However, if we also assume that the various ai values all come from a common probability distribution, such as a normal distribution with a population mean and standard deviation denoted by a and , respectively, then we can estimate these latter two values.
In general, fitting such a random effects model can be quite a complex task, one that we choose not to discuss here. However, the simplest output from appropriate software will be similar to that which we have encountered for other regression models. Also, as we noted in the case of Liang-Zeger models, the general structure of such a random effects assumption can be incorporated into many different types of regression models, including those we have considered in the preceding four chapters.
The assumption that the subject-specific parameters ai belong to a common probability distribution is why such a model is called a ‘random effects’ model. Notice, however, that the regression coefficient associated with the explanatory variable of interest, which denotes diabetic status in this example, measures how the odds in favour of an abnormal ovulatory cycle for any particular woman, with her specific value of ai, would change if the woman was diabetic compared to the corresponding odds if she was not diabetic. Thus, in contrast to Liang-Zeger marginal models, this type of random effects model is sometimes called a subject-specific regression model. A full discussion of the various distinctions between these models is beyond the scope of this book,
Random Effects Models |
165 |
Table 14.2. The results of two logistic regression analyses of longitudinal data collected from 81 women concerning diabetic status and abnormal ovulatory cycles
Regression |
Estimate |
Estimated |
Significance |
coefficient |
|
standard |
level |
|
|
error |
|
|
|
|
|
Ordinary logistic regression |
|
|
|
a |
–0.72 |
0.15 |
– |
b |
0.55 |
0.26 |
0.032 |
|
|
|
|
Random effects logistic regression |
|
|
|
a1 |
–0.89 |
0.21 |
– |
b |
0.67 |
0.38 |
0.079 |
1 Population mean.
but readers may encounter this terminology in other settings, and we hope our brief discussion has provided some useful background.
14.3.3. Illustrative Results
Among the 23 diabetic women, 43 of 106 ovulatory cycles (41%) were observed to be abnormal, while in the 58 healthy women, 51 of 181 cycles (28%) were abnormal. Table 14.2 summarizes the results of fitting two different logistic regression models to these data. The first analysis uses a simple logistic model that corresponds to equation (14.1), and the second is based on the random effects model specified in equation (14.2).
If we compare the results of the two different fitted models summarized in table 14.2, we see that in the ordinary logistic regression analysis, in which each of the 287 ovulatory cycles are treated as independent observations, the estimated regression coefficient associated with diabetic status is found to be significantly different from zero. However, although the corresponding estimated regression coefficient in the random effects model is roughly the same size, and has the same sign as its counterpart in the other analysis, the estimated standard error is larger in the random effects model and hence we would conclude that the data do not represent evidence to contradict the hypothesis b = 0. This outcome reflects a typical pattern that unfolds in such cases, namely that an analysis which fails to account for the correlation in longitudinal data appropriately is more likely to identify significant explanatory variable effects than one which does make some allowance for the correlation.
In table 14.2 the estimated mean of the subject-specific random effects is similar to the single estimated intercept, â = –0.72, in the ordinary logistic re-
14 The Analysis of Longitudinal Data |
166 |
gression. However, the two estimated values do not have the same interpretation. In addition, the estimated value of , the standard deviation of the common distribution of the subject-specific intercepts, is 1.02; this value is not the same as the tabulated standard error associated with the estimated mean of the distribution, i.e., 0.21.
If we base our conclusions concerning this study on the more appropriate logistic regression model that involves a random, subject-specific intercept, we can estimate that the odds of an abnormal ovulatory cycle are exp(0.67) = 1.95 times greater for a diabetic woman than for a healthy woman. The corresponding 95% confidence interval for this odds ratio is exp{0.67 8 1.96(0.38)} = (0.93, 4.12), which includes 1. Therefore, in this limited data set the statistical evidence for an effect that being diabetic has on a woman’s ovulatory cycles is marginal. Moreover, the estimated value of suggests that there is considerable variation from woman to woman with respect to the probability of experiencing an abnormal ovulatory cycle.
14.3.4. Comments
In the preceding two sections, we have provided a rather brief introduction to two relatively new statistical methods that use regression models to analyze data from longitudinal studies. We hope that the examples we have discussed in §§14.2 and 14.3 will provide a basis for understanding the presentation of such analyses in the medical literature. Readers who are interested in using either Liang-Zeger or random effects regression models to analyze a particular study should consult a statistician.
Dependence between observations is both a central feature of longitudinal studies and a critical assumption in the use of regression models. We trust that the preceding discussion will enable readers to identify the possibility for dependence in a study design, and thereby furnish a starting point for choosing a suitable method of analysis.
14.4. Multi-State Models
14.4.1 The Study
Another example of a study that resulted in longitudinal data is an investigation of disease progression in patients with psoriatic arthritis reported by Gladman et al. [30]. The subjects enrolled in the study were followed prospectively over a period of 14 years. During this time, study participants were treated at a single clinic and, at each clinic visit, standardized assessments of clinical and laboratory variables were obtained.
Multi-State Models |
167 |
- #
- #
- #28.03.202681.2 Mб0Ultrasonography of the Eye and Orbit 2nd edition_Coleman, Silverman, Lizzi_2006.pdb
- #
- #
- #
- #28.03.202621.35 Mб0Uveitis Fundamentals and Clinical Practice 4th edition_Nussenblatt, Whitcup_2010.chm
- #
- #
- #28.03.202627.87 Mб0Vaughan & Asbury's General Ophthalmology 17th edition_Riordan-Eva, Whitcher_2007.chm
- #
