Ординатура / Офтальмология / Английские материалы / Using and Understanding Medical Statistics_Matthews, Farewell_2007
.pdf
Figure 10.2 shows separate plots of brain weight versus litter size and body weight, and litter size versus body weight. The plots indicate that when brain weight is high or low, body weight tends to be correspondingly high or low, whereas litter size tends to be the opposite. The relationship between litter size and body weight is similar to the relationship between brain weight and litter size. A numerical measure of the observed association between two variables is the correlation coefficient r. For completeness, we give the formula for r, which is
|
n |
|
|
|
|
|
||
|
(xi |
x)(yi |
|
|
|
|
||
|
y) |
|||||||
r = |
i=1 |
|
|
|
|
, |
||
n |
|
|||||||
|
n |
|||||||
|
(xi |
x)2 |
(yi |
y)2 |
||||
|
i=1 |
i=1 |
||||||
where n is the number of paired observations on X and Y.
The correlation coefficient is a number between –1 and 1; the value r = 0 indicates there is no linear relationship between the two variables. A statistical procedure has been developed to test the hypothesis of no association between X and Y, based on the observed correlation coefficient, but we do not intend to discuss it here. If r is negative, then X and Y are said to be negatively correlated, implying that low values of X tend to occur with high values of Y and vice-versa. This kind of association is illustrated by the relationship between brain weight and litter size, which have a correlation coefficient of –0.62. If r is positive, then X and Y have a relationship like that observed between brain weight and body weight. The correlation coefficient for these variables is 0.75.
Figure 10.3 contains eight plots which illustrate the degree of linear relationship between the values of two variables, X and Y, corresponding to various positive and negative values of the correlation coefficient. The plots are based on 50 (X, Y) pairs and have been artificially constructed so that, in each case, the 50 X values displayed in each plot are identical.
The correlation coefficient can be used in the initial examination of a data set to identify relationships which deserve further study. However, it is generally more useful to think of linking two variables via a regression equation. From a regression analysis we can see very directly how changes in one variable are associated with changes in the other outcome variable. The regressions of litter size on brain weight and body weight are summarized by the equations
LITŜIZ = 47.41 – 95.76(W) and LITŜIZ = 23.51 – 2.07(BODY).
Thus, for a specified value of brain weight or body weight, we could ‘predict’ a value for litter size.
Correlation |
119 |
y
4
2
0
–2
–4
y
r = 0.71
4
2
0
–2
–4
r = 0.99
a
y
c
4
2
0
–2
–4
–2 |
–1 |
0 |
1 |
2 |
3 |
|
|
x |
|
|
b |
y
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
r = 0.53 |
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
–2 |
–1 |
0 |
1 |
2 |
3 |
|
||||||||||||
|
|
|
|
|
|
|
|
|
x |
|
|
|
|
|
|
d |
||
–2 |
–1 |
0 |
1 |
2 |
3 |
x
4
2
0
–2
–4
r = –0.01
–2 |
–1 |
0 |
1 |
2 |
3 |
|
|
|
x |
|
|
Fig. 10.3. Scatterplots of 50 artificial (X, Y) measurement pairs illustrating various values of the estimated correlation coefficient r. a r = 0.71, b r = 0.99, c r = 0.53, d r = –0.01, e r = –0.27, f r = –0.51, g r = –0.75, h r = –0.92.
Another reason for generally preferring regression analysis is the fact that correlation analysis is applicable only to situations when both X and Y are random. In many studies, the selection of subjects will depend on the values of certain variables for these subjects. Such selection invalidates correlation analysis. For example, Armitage et al. [19] indicate that if, from a large population of individuals, selection restricts the values of one variable to a limited range, the absolute value of the correlation coefficient will decrease.
10 Linear Regression Models for Medical Data |
120 |
|
4 |
|
|
|
|
|
|
2 |
|
|
|
|
|
y |
0 |
|
|
|
|
|
|
–2 |
|
|
|
|
|
|
–4 |
|
|
|
|
r = –0.27 |
|
|
|
|
|
|
|
|
–2 |
–1 |
0 |
1 |
2 |
3 |
e |
|
|
|
x |
|
|
|
4 |
|
|
|
|
|
|
2 |
|
|
|
|
|
y |
0 |
|
|
|
|
|
|
–2 |
|
|
|
|
|
|
–4 |
|
|
|
|
r = –0.75 |
|
|
|
|
|
|
|
|
–2 |
–1 |
0 |
1 |
2 |
3 |
g |
|
|
|
x |
|
|
3
y
f
y
h
4
2
0
–2
–4
r = –0.51
–2 |
–1 |
0 |
1 |
2 |
3 |
|
|
|
x |
|
|
4
2
0
–2
–4
r = –0.92
–2 |
–1 |
0 |
1 |
2 |
3 |
|
|
|
x |
|
|
Two additional points deserve brief mention. The first concerns correlation measures which do not depend on the assumption that X and Y have normal distributions. Two of these measures are Spearman’s rank correlation coefficient and Kendall’s (tau). These are useful for analyzing non-normal data, but the general reservations concerning correlation analysis which we have already mentioned apply equally to any measure of correlation.
The second issue is the following. Although we advocate the use of regression models, it is important to realize that the estimated regression line of Y on X is not the same as that for X on Y. This can be seen by comparing the re-
Correlation |
121 |
gression lines for brain weight on litter size and litter size on brain weight. This asymmetry may seem strange, but it is linked to use of the differences (Y – Ŷ) –
ˆ
or (X – X) if X is the response variable – as an estimation criterion. If one variable is random, say Y, and the other is selectively sampled, then the regression of Y on X will be sensible. Otherwise, the choice of a suitable regression model may be determined by some preference for predicting one variable on the basis of the other.
10.5. The Analysis of Variance
As we mentioned in §10.3, a multiple regression analysis is frequently summarized in an analysis of variance (ANOVA) table. This section, which is only necessary to the understanding of chapter 15, provides a brief introduction to ANOVA tables. On a first reading, we strongly advise readers to skip over this section and that related chapter and proceed to chapter 11.
In the example we have been discussing, we have observed values of the response variable, W, and predicted values, Ŵ, which are derived from the regression equation. Let W represent the average of all the observed brain weights. In chapter 1, we learned that measures of spread, such as variance, usually involve the quantity (W – W), which represents the difference of observed values of W from their average. Strictly speaking, we should refer to (W – W)2 rather than the simple difference (W – W). We will shortly introduce the squared difference; however, it is simpler, for the moment, to begin our explanation by discussing (W – W). In fact, the analysis of variance is based on the relation
— —
(W – W) = (W – Ŵ) + (Ŵ – W),
which decomposes (W – W) into two separate components. If we regard (W – W) as ‘total variation’, i.e., variance, it is natural to ask what components of variation (W – Ŵ) and (Ŵ – W) represent.
In the absence of a regression model for the mean value of W, the dependent variable, W represents a natural estimate. Thus, (W – W) can be interpreted as the variation of W around this natural estimate. Once we have specified a regression model for the mean value of W, we have a different estimate, Ŵ, which is based on the model. Moreover, (W – Ŵ) will generally be smaller than (W – W). Therefore, of the total variation represented by (W – W), the component (Ŵ – W) has been ‘explained’ or accounted for by the fitted regression model for W. That is, since Ŵ is now the estimated mean value of W, based on the values of LITSIZ and BODY, we expect to see W differ from W by (Ŵ – W). The remaining component, (W – Ŵ), is called the ‘residual variation’, i.e., the variation in W which is not explained by the regression equation.
10 Linear Regression Models for Medical Data |
122 |
Now that we have informally described the decomposition of the variance in W, we introduce squared differences, which are the correct representation.
|
|
|
1 |
n |
|
Let W1, W2, ..., Wn be n observed values of W, and let W = |
Wi be their |
||||
|
|
|
n |
i = |
1 |
average. The total variation in these data can be represented by
n
(Wi W)2 =(W1 W)2 +(W2 W)2 +... +(Wn W)2 ,
i=1
which is (n–1) times the sample variance for W. Although it is not obvious, it can be shown that
n |
n |
|
n |
|
|
|
||
|
|
2 |
ˆ |
2 |
ˆ |
|
|
2 |
|
|
|||||||
(Wi W) |
= (Wi Wi ) |
+ (Wi |
W) . |
|||||
i=1 |
i=1 |
|
i=1 |
|
|
|
||
Thus, the total variation in W decomposes into two sums of squared differences. In view of the preceding discussion involving simple differences, it should not be difficult to accept that the second sum represents the component of total variation in W which is accounted for by the fitted regression model, while the first sum represents the residual variation. If we use SS to represent a sum of squares, then we can rewrite the above equation, symbolically, as
SSTotal = SSResidual + SSModel ;
the reasons for the subscripts should be quite obvious. Each sum of squares of the form
n
(Wi [estimated value of Wi ])2
i=1
has associated with it a quantity called its degrees of freedom (DF). This quantity is equal to the number of terms in the sum minus the number of values which must be calculated from the Wi’s in order to specify all the estimated values. For example, in SSTotal the estimated value for each Wi is the same,
namely W. Thus, we need one calculated value, and the degrees of freedom for SSTotal are (n – 1). For SSResidual, the estimated value of each Wi is equal to
ˆ |
k ˆ |
, |
Wi |
=aˆ + bjxj |
j=1
where x1, …, xk are the values of the covariates corresponding to Wi. These n
ˆ ˆ
estimates require (k + 1) calculated values, â, b1, ..., bk; therefore, the degrees
of freedom for SSResidual are (n – k – 1).
Although we shall not attempt to justify the following result, it can be shown that
DFTotal = DFResidual + DFModel.
The Analysis of Variance |
123 |
Table 10.4. An ANOVA table for the regression analysis of brain weight
Term |
|
SS |
DF |
MS |
F |
|
|
|
|
|
|
Model |
0.004521 |
2 |
0.002260 |
15.8 |
|
Residual |
0.002429 |
17 |
0.000143 |
|
|
|
|
|
|
|
|
Total |
0.006950 |
19 |
|
|
|
|
|
|
|
|
|
R2 = 0.65. |
|
|
|
|
|
Therefore, since DFTotal equals (n – 1) and DFResidual equals (n – k – 1), it follows that the degrees of freedom for SSModel are equal to k.
The only remaining unknown quantities which appear in an ANOVA table are called mean squares (MS); these are defined to be the ratio of a sum of squares to its degrees of freedom, i.e., MS = SS/DF. The variance estimates which we discussed in chapter 9 can be recognized as mean squares, and this is partly the reason that statisticians call the method of analysis which we are presently describing ‘the analysis of variance’.
The typical format for an ANOVA table is shown in table 10.4, which gives the appropriate entries for the regression of W on BODY and LITSIZ. From this table, two quantities are usually calculated. The first of these is a ratio called R2 (R-squared), which is equal to
R2 = SSModel/SSTotal.
This ratio indicates the fraction of the total variation in W which is accounted for by the fitted regression model. There are no formal statistical tests associated with R2, and it is primarily used for information purposes only. In table 10.4, the value of R2 is 0.65, indicating that 65% of the total variation in W is explained by the regression model.
The second quantity which is calculated from an ANOVA table such as
table 10.4 is the ratio of MSModel to MSResidual. This number is frequently called an F-ratio because, if the null hypothesis that all the regression coefficients b1,
…, bk are equal to zero is true, the ratio MSModel/MSResidual should have an F- distribution with DFModel and DFResidual degrees of freedom. Recall that the F- distribution was introduced in §9.5.
The observed value of this F-ratio is usually compared with the critical values for the appropriate F-distribution (see table 9.7), and if the observed value exceeds the critical value, the regression is said to be significant, i.e., the data contradict the null hypothesis that all the regression coefficients are zero. In table 10.4, the observed F-ratio is 0.00226/0.000143 = 15.8. Since the 5%
10 Linear Regression Models for Medical Data |
124 |
Table 10.5. An expanded ANOVA table for the regression analysis of brain weight
Term |
|
SS |
DF |
MS |
F |
|
|
|
|
|
|
Model |
0.004521 |
2 |
0.002260 |
15.8 |
|
BODY |
0.003869 |
1 |
0.003869 |
27.1 |
|
LITSIZ BODY |
0.000652 |
1 |
0.000652 |
4.6 |
|
|
|
|
|
|
|
Residual |
0.002429 |
17 |
0.000143 |
|
|
|
|
|
|
|
|
Total |
0.006950 |
19 |
|
|
|
|
|
|
|
|
|
critical value for an F-distribution with 2 and 17 degrees of freedom is 3.59, we conclude that the regression is significant at the 5% level; the data contradict the hypothesis that both BODY and LITSIZ have no effect on brain weight. This result is consistent with the conclusion which we reached in §10.3 (cf. table 10.3) that the regression coefficients for BODY and LITSIZ are significantly different from zero.
Table 10.3 summarizes the results of tests concerning the effect of individual covariates when the other covariate was included in the regression model. For example, we determined whether LITSIZ had any influence on brain weight if BODY was already included in the model. By comparison, the F-test which we have discussed above is a joint test of the hypothesis that all the regression coefficients are zero. However, the ANOVA table which we have described can be generalized to address the question of the individual effect of each covariate.
If we had calculated an ANOVA table for the regression of W on BODY,
then the SSModel would have been 386.9 ! 10–5, and DFModel would have been one. This number, 386.9 ! 10–5, represents the component of the SSModel in table 10.4 which is due to BODY alone. If we next add LITSIZ to the regression
equation, so that we are fitting the model described by table 10.4, the SSModel increases by 65.2 ! 10–5, from 386.9 ! 10–5 to 452.1 ! 10–5. Thus, 65.2 !
10–5 is the component of SSModel which is accounted for by LITSIZ, when the model already includes BODY. Notice, also, that we could have performed
these calculations in the reverse order, i.e., LITSIZ first, followed by BODY.
In table 10.5, the SSModel is divided into two parts. One part is labelled BODY, and represents the component of SSModel which is due to BODY; the other part is labelled LITSIZ|BODY, and represents the component which is
due to LITSIZ in addition to BODY. The degrees of freedom for each compo-
nent in the sum of squares are one since DFModel equals two and the component of the SSModel which is due to BODY has one degree of freedom. Therefore, two F-ratios can be calculated, and these appear in the last column of table 10.5;
The Analysis of Variance |
125 |
Table 10.6. An ANOVA table for the regression of brain weight on litter size
Term |
SS |
DF |
MS |
F |
|
|
|
|
|
Model |
0.002684 |
1 |
0.002684 |
11.3 |
Residual |
0.004266 |
18 |
0.000237 |
|
Lack of fit |
0.001290 |
8 |
0.000161 |
0.54 |
Pure error |
0.002976 |
10 |
0.000298 |
|
|
|
|
|
|
Total |
0.006950 |
19 |
|
|
|
|
|
|
|
one F-ratio corresponds to BODY alone, and the other represents LITSIZ| BODY. If the null hypothesis that BODY has no influence on W is true, the F- ratio for BODY should have an F-distribution on 1 and 17 degrees of freedom. Quite separately, if the null hypothesis is true that LITSIZ has no influence on W which is additional to the effect of BODY, then the F-ratio for LITSIZ|BODY should also have an F-distribution on 1 and 17 degrees of freedom. Since the 5% critical value for this particular F-distribution is 4.45, we conclude that each of the terms in the regression model – BODY and LITSIZ|BODY – is necessary since a test of the respective null hypothesis has a significance level of less than 5%. This conclusion coincides with the analysis which we discussed in §10.3 (cf. table 10.3).
To illustrate one additional type of calculation, table 10.6 presents an ANOVA table for the simple linear regression of brain weight on litter size. We assume no additional variables are available. In an analysis of variance, the
MSResidual is often used as an estimate of variance. The MSResidual represents the variation which cannot be accounted for by the regression equation, and it is
often assumed that this unexplained variation is the natural variance of the observations about the estimated values which are determined by the regression equation. Since we have two independent observations on each of the ten litter sizes, we can calculate a separate, independent estimate of the natural variation in the model. Let W1 and W2 represent the two observations for a single litter size. The natural estimate of the mean brain weight for this litter size is W = (W1 + W2)/2, and an estimate of the residual variation, based on these weights alone, would be a SS which is equal to (W1 – W)2 + (W2 – W)2.
Since we have two terms in the sum and one estimate, W, the degrees of freedom for this sum would be one. If we repeat this calculation for all ten litter sizes, then the total of the ten individual sums of squares is 297.6 ! 10–5, and this total sum of squares would have 10 ! 1 = 10 degrees of freedom.
This type of calculation, which leads to a MS of 297.6 ! 10–5/10 = 29.8 ! 10–5 in our example, is often called a calculation of ‘pure error’. This is because,
10 Linear Regression Models for Medical Data |
126 |
regardless of the information we have concerning an individual litter size, the variation of independent observations from litters of the same size can never be accounted for by the regression equation. Therefore, MSPure Error is a better
estimate of the natural variance of the observations than MSResidual. Recall that we are assuming we do not have any information apart from brain weight and
litter size. To calculate a SSPure Error for the regression summarized in table 10.5 would require independent observations on mice which have the same body
weight and were reared in litters of the same size. Such observations are not available, and therefore the pure error calculations cannot be carried out in that particular case.
The SSpure Error is one component of SSResidual, and the remainder is usually called the Lack of Fit component, since it constitutes a part of the SSTotal which cannot be accounted for, either by the regression model or by pure error.
The SSLack of Fit measures the potential for improvement in the model for W if additional covariate information is used, or possibly if an alternative form of
the regression equation is specified instead of the current version. A test that the lack of fit in the current model is significant can be based on the F-ratio
MSLack of Fit/MSpure Error. If the null hypothesis that there is no lack of fit in the current model is true, this ratio should have an F-distribution with degrees of
freedom DFLack of Fit = DFResidual – DFPure Error and DFPure Error. We will not dis-
cuss the question of lack of fit in detail; however, if there is a significant lack of fit, it is customary to plot the values of W – Ŵ for all the observations, to see if they appear to be normally distributed. If so, then the lack of fit is usually attributed to unavailable information rather than the existence of better alternative models which involve the same variates. In table 10.6, the test for Lack of Fit is not significant.
Despite its very obvious link to the methods of linear regression, ANOVA is frequently regarded by many investigators as a specialized statistical method. This is particularly the case when ANOVA is used to evaluate the results of carefully designed experiments in which the researcher fixes or controls the operational settings of certain explanatory variables, which are usually called factors. Although ANOVA, and the corresponding ANOVA tables that we have described in this section, do not play a role in the methods of analysis used with other types of regression models such as those involving response measurements that are binary or count data, the use of ANOVA is occurring with greater frequency in the medical literature. Therefore, we will return to the topic of ANOVA in chapter 15, but judge it best to introduce first the use of regression methods for other kinds of response measurements in chapters 11–14.
The Analysis of Variance |
127 |
11
U U U U U U U U U U U U U U U U U U U U U U U U U U U
Binary Logistic Regression
11.1. Introduction
One reason that linear regression is not as widely used in medical statistics as in other fields is that the outcome variable in a medical study frequently cannot be assumed to have a normal distribution. A common type of response or outcome variable in medical research is a binary variable. These response variables take one of two values, and were discussed extensively in chapters 2 through 5. In this chapter, we describe a particular regression model for a binary response variable.
To illustrate the methodology, we shall consider an example concerning bone marrow transplantation for the treatment of aplastic anemia. One response of major interest is graft rejection. A binary variable, Y, can be defined so that Y = 1 corresponds to graft rejection and Y = 0 represents graft acceptance.
Table 11.1 is taken from an article by Storb et al. [3]. It presents a binary logistic regression analysis of graft rejection, relating the response variable, Y,
Table 11.1. Maximum likelihood fit of a binary logistic regression model to marrowgraft rejection data on 68 patients with aplastic anemia [3]
Factor |
Logistic |
Standard |
Ratio |
|
coefficient |
error |
|
|
|
|
|
Marrow cell dose |
–1.005 |
0.344 |
–2.92 (p < 0.01) |
Age |
–0.457 |
0.275 |
–1.66 (p < 0.10) |
Blood units |
1.112 |
0.672 |
1.65 (p < 0.10) |
Transplant year |
0.735 |
0.319 |
2.30 (p < 0.05) |
Androgen treatment |
1.417 |
0.805 |
1.76 (p < 0.10) |
|
|
|
|
Reprinted from Storb et al. [3] with the kind permission of the publisher.
- #
- #
- #28.03.202681.2 Mб0Ultrasonography of the Eye and Orbit 2nd edition_Coleman, Silverman, Lizzi_2006.pdb
- #
- #
- #
- #28.03.202621.35 Mб0Uveitis Fundamentals and Clinical Practice 4th edition_Nussenblatt, Whitcup_2010.chm
- #
- #
- #28.03.202627.87 Mб0Vaughan & Asbury's General Ophthalmology 17th edition_Riordan-Eva, Whitcher_2007.chm
- #
