Добавил:
kiopkiopkiop18@yandex.ru t.me/Prokururor I Вовсе не секретарь, но почту проверяю Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Ординатура / Офтальмология / Английские материалы / Using and Understanding Medical Statistics_Matthews, Farewell_2007

.pdf
Скачиваний:
0
Добавлен:
28.03.2026
Размер:
3.03 Mб
Скачать

ture of the practical problem makes this ad hoc approach inappropriate or inadequate, a statistician should be consulted.

In chapter 10, we will also be discussing a method for analyzing normally distributed data. In fact, the techniques which we have described in chapter 9 can be regarded as a special case of the methodology which is presented in chapter 10. We do not intend to elaborate further on this connection, since the purposes of these two chapters are quite different. Chapter 9 constitutes a brief introduction to material which, although important, is not a primary focus of this book. However, in chapter 10 we introduce a class of statistical techniques which will feature prominently in the remaining chapters. We therefore direct our attention, at this point, to the important topic of regression models.

9

Analyzing Normally Distributed Data

110

10

U U U U U U U U U U U U U U U U U U U U U U U U U U U

Linear Regression Models for Medical Data

10.1. Introduction

Many medical studies investigate the association of a number of different factors with an outcome of interest. In some of the previous chapters, we discussed methods of studying the role of a single factor, perhaps with stratification according to other factors to account for heterogeneity in the study population. When there is simultaneous interest in more than one factor, these techniques have limited application.

Regression models are frequently used in such multi-factor situations. These models take a variety of forms, but their common aim is to study the joint effect of a number of different factors on an outcome variable. The quantitative nature of the outcome variable often determines the particular choice of regression model. One type of model will be discussed in this chapter. The five succeeding chapters will introduce other regression models. In all six chapters, we intend to emphasize the general nature of these models and the types of questions which they are designed to answer. We hope that the usefulness of regression models will become apparent as our discussion of them proceeds.

Before we launch into a description of linear regression models and how they are used in medical statistics, it is probably important to make a few brief comments about the concept of a statistical model. A relationship such as Einstein’s famous equation linking energy and mass, E = mc2, is an exact and true description of the nature of things. Statistical models, however, use equations quite differently. The equation for a statistical model is not expected to be exactly true; instead, it represents a useful framework within which the statistician is able to study relationships which are of interest, such as the association between survival and age, aggressiveness of disease and the treatment which a patient receives. There is frequently no particular biological support for statis-

Average son’s height (in)

80

75

70

65

60

55

55

60

65

70

75

80

Average father’s height (in)

Fig. 10.1. A scatterplot of the data collected by Pearson and Lee [17] concerning the heights of father-son pairs. The equation of the fitted regression line is Ŷ = 33.73 + 0.516X.

tical models. In general, they should be regarded simply as an attempt to provide an empirical summary of observed data. Finally, no model can be routinely used without checking that it does indeed provide a reasonable description of the available data.

10.2. A Historical Note

The term ‘regression’ arose in the context of studying the heights of members of family groups. Pearson and Lee [17] collected data on the heights of 1078 father-son pairs in order to study Galton’s ‘law of universal regression’ which states that ‘Each peculiarity in a man is shared by his kinsmen, but on the average in a less degree’. Figure 10.1 shows a plot of the average height of the sons in each of 17 groups which were defined by first classifying the fathers into 17 groups, using one-inch intervals of height between 58.5 and 75.5 inches.

If we represent the height of a son by the random variable Y, and the height of a father by the random variable X, then the straight line drawn in figure 10.1 is simply the equation

Son’s height = 33.73 + 0.516 ! Father’s height

 

or

 

Y = 33.73 + 0.516 X.

(10.1)

 

 

 

10 Linear Regression Models for Medical Data

112

We shall use numbers in parentheses to label equations to which we later wish to refer. Equation (10.1) is called a regression equation. The variable Y is called the dependent, outcome or response variable, and X is called the independent or explanatory variable. Since the adjective independent is somewhat misleading, although widely used, we shall use the adjective explanatory. Another common term which we shall also use is covariate.

Obviously, not all the father-son pairs of heights lie along the drawn straight line. Equation (10.1) represents the best-fitting straight line of the general form

Y = a + bX,

(10.2)

where best-fitting is defined as the unique line which minimizes the average of the squared distances from each observed son’s height to the chosen line. More specifically, we write the equation of the best-fitting line as

ˆ

Ŷ = â + b X,

ˆ

where â and b are best choices or estimates for a and b, and Ŷ is the value of Y predicted by this model for a specified value of X. In general, statisticians use the ˆ notation to indicate that something is an estimate. The line in figure 10.1 minimizes the average of the values (Y – Ŷ)2 . It is called a linear regression because Y is a straight-line function of the unknown parameters a and b.

Notice that equation (10.1) is an empirical relation, and that it does not imply a causal connection between X and Y.

In this example, the estimated line shows that there is a regression of sons’

ˆ

heights towards the average. This is indicated by the fact that b, the multiplier of the fathers’ heights, is much less than one. This historical terminology has persisted, so that an equation like (10.2) is still called a regression equation, and b is called a regression coefficient, whatever its value.

10.3. Multiple Linear Regression

Table 10.1 presents data from an experiment carried out by Wainwright et al. [18] to study nutritional effects on preweaning mouse pups. The level of nutrient availability was manipulated by rearing the pups in litter sizes ranging from three to twelve mice. On day 32, body weight and brain weight were measured. The values in the table are the average body weight (BODY) and brain weight (W) for each of 20 litters, two of each litter size (LITSIZ). For the purpose of illustration, we regard these averages as single observations. In this study, the effect of nutrition on brain weight is of particular interest.

Multiple Linear Regression

113

Table 10.1. Average weight measurements from 20 litters of mice

Litter

Body

Brain

Litter

Body

Brain

size

weight, g

weight, g

size

weight, g

weight, g

 

 

 

 

 

 

3

9.447

0.444

8

7.040

0.414

3

9.780

0.436

8

7.253

0.409

4

9.155

0.417

9

6.600

0.387

4

9.613

0.429

9

7.260

0.433

5

8.850

0.425

10

6.305

0.410

5

9.610

0.434

10

6.655

0.405

6

8.298

0.404

11

7.183

0.435

6

8.543

0.439

11

6.133

0.407

7

7.400

0.409

12

5.450

0.368

7

8.335

0.429

12

6.050

0.401

 

 

 

 

 

 

Figure 10.2 consists of separate scatterplots of the average brain weight of the pups in a litter versus the corresponding average body weight and the litter size, as well as a third scatterplot of litter size versus average body weight. Notice that both body weight and litter size appear to be associated with the average brain weight for the mouse pups.

A simple linear regression model relates a response or outcome measurement such as brain weight to a single explanatory variable. Equation (10.1) is an example of a simple linear regression model. By comparison, a multiple linear regression model attempts to relate a measurement like brain weight to more than one explanatory variable. If we represent brain weight by W, then a multiple linear regression equation relating W to body weight and litter size would be

W = a + b1(BODY) + b2(LITSIZ).

At this point, it is helpful to introduce a little notation. If Y denotes a response variable and X1, X2, …, Xk denote explanatory variables, then a regression equation for Y in terms of X1, X2, …, Xk is

Y = a + b1X1 + b2X2 + … + bkXk

or

k

Y =a + bi Xi .

i=1

Remember, also, that if we wish to refer to specific values of the variables X1, X2, …, Xk, it is customary to use the lower case letters x1, x2, …, xk.

10 Linear Regression Models for Medical Data

114

Average brain weight (g)

a

Litter size

c

0.50

 

 

 

 

 

 

0.45

 

 

 

 

 

(g)

 

 

 

 

 

weight

 

 

 

 

 

 

0.40

 

 

 

 

 

brain

 

 

 

 

 

 

0.35

 

 

 

 

 

Average

 

 

 

 

 

 

0.30

 

 

 

 

 

 

0

2

4

6

8

10

12

 

 

 

Litter size

 

b

12

10

8

6

4

2

0

5

6

7

8

9

10

11

 

 

Average body weight (g)

 

0.50

0.45

0.40

0.35

0.30

5

6

7

8

9

10

11

Average body weight (g)

Fig. 10.2. Scatterplots of average weight measurements from 20 litters of mice. a Average brain weight vs. litter size. b Average brain weight vs. average body weight. c Litter size vs. average body weight.

 

 

ˆ

, …,

ˆ

A multiple linear regression analysis finds estimates â, b1

bk of a,

b1, …, bk which minimize the average value of

 

 

ˆ 2

k ˆ

2

 

 

(Y Y)

=(Y aˆ bi Xi ) .

 

 

i=1

The assumption which underlies this analysis is that, for specified covariate values X1 = x1, X2 = x2, …, Xk = xk, the distribution of Y is normal with mean or expected value

k

a + bi xi .

i=1

Multiple Linear Regression

115

Table 10.2. Estimated regression coefficients and corresponding standard errors for simple linear regressions of brain weight on body weight and litter size

Covariate

Estimated regres-

Estimated

Test statistic

Significance

 

sion coefficient

standard error

 

level (p-value)

 

 

 

 

 

BODY

0.010

0.002

5.00

<0.0001

LITSIZ

–0.004

0.001

4.00

<0.001

 

 

 

 

 

The variances of the different normal distributions corresponding to different sets of covariate values are assumed to be the same. Notice that a regression analysis makes no assumptions about the distribution of the explanatory variables X1, X2, …, Xk. In many designed experiments, values of the explanatory variables are, in fact, selected by the investigator in order to study the relationship between Y and X1, X2, …, Xk systematically.

In our example, two separate, simple linear regressions of brain weight on body weight and litter size can be performed. These lead to the two equations

Ŵ = 0.336 + 0.010(BODY) and Ŵ = 0.447 – 0.004(LITSIZ).

If the multiplier, or regression coefficient, for any variable was zero, then that variable would have no influence on the response variable W. It is very unlikely that the best estimate of a regression coefficient would be exactly zero. A more reasonable question to ask is whether the data provide evidence to contradict the hypothesis that the regression coefficient could be zero, thereby confirming a relationship between the explanatory and response variables.

Table 10.2 lists the estimated regression coefficients and their corresponding standard errors for the two simple linear regressions of brain weight on body weight and litter size. If there is no relationship between the explanatory

ˆ ˆ

variable and brain weight, then the ratio b/est. standard error (b) has a Student’s t distribution with the same number of degrees of freedom as the estimated standard error. The results of chapter 9, therefore, lead to the test statistic

 

ˆ

T =

| b |

 

 

est. standard error (b)

to test the hypothesis that the regression coefficient b equals zero. To calculate the significance level of the test, the observed value of T is compared with the critical values of the modulus of a Student’s t distribution with the appropriate degrees of freedom (cf. table 9.4). Table 10.2 also includes the observed values of T and the associated significance levels for each regression coefficient.

10 Linear Regression Models for Medical Data

116

Table 10.3. Estimated regression coefficients and corresponding standard errors for a multiple linear regression of brain weight on body weight and litter size

Covariate

Estimated regres-

Estimated

Test statistic

Significance

 

sion coefficient

standard error

 

level (p-value)

 

 

 

 

 

BODY

0.024

0.007

3.43

0.003

LITSIZ

0.007

0.003

2.33

0.032

 

 

 

 

 

The significance levels summarized in table 10.2 indicate that the data provide strong statistical evidence of a relationship between brain weight and each of the explanatory variables. However, these relationships are not particularly strong physical effects. In fact, the average brain weight of a mouse pup increases by only 0.010 g per one-gram increase in body weight. Similarly, the average brain weight of a mouse pup decreases by 0.004 g for each unit increase in size of the litter into which the pup is born.

The analyses corresponding to the results presented in table 10.2 examine the separate effects of body weight and litter size on brain weight. The joint effects of the two variables are investigated via the multiple linear regression equation

W = a + b1(BODY)+ b2(LITSIZ),

(10.3)

which is estimated by

Ŵ = 0.178 + 0.024(BODY) + 0.007(LITSIZ).

ˆ ˆ

Table 10.3 gives the estimated regression coefficients b1 and b2, their estimated standard errors and the ratios used to test for a non-zero coefficient, based on the model specified by equation (10.3).

Clearly, table 10.3 is quite different from table 10.2. The coefficients change in size and, for litter size, in sign. The difference arises because the test for a relationship between LITSIZ, for example, and brain weight in table 10.3 is performed when the other variable, BODY, is included in the model. Thus, although LITSIZ is inversely related to brain weight when examined singly (see table 10.2), the model results summarized in table 10.3 tell us that LITSIZ is positively related to brain weight after adjusting for the available information on body weight. On the other hand, body weight is positively related to brain weight, even after litter size is taken into account. Both covariates have coefficients which are significantly different from zero; therefore, each provides information concerning brain weight which is additional to that provided by the other variable. Thus, both covariates should be included in a model describing brain weight.

Multiple Linear Regression

117

Table 10.3 is consistent with the biological concept of brain sparing, whereby the nutritional deprivation represented by litter size has a proportionately smaller effect on brain weight than on body weight. In a simple linear regression, litter size is negatively related to brain weight because the larger litters tend to consist of the smaller mice. In the multiple linear regression, the effect of body size is taken into account and the positive regression coefficient associated with litter size indicates that mice from large litters will have larger brain weights than mice of comparable size from smaller litters.

The analysis which we have described in this section is approximate and can be improved upon. The results of a linear regression analysis are frequently presented, in summary form, in an analysis of variance (ANOVA) table. Since this type of presentation does not extend easily to other regression models (see chapters 11–14), it is not of primary importance to the aims of this book. Our presentation is consistent with the usual analysis of other regression models which are widely used in medical research, and therefore serves as a useful introduction to the general topic of regression models.

Before we consider other types of regression models, we propose to briefly discuss correlation analysis, a historical antecedent to regression analysis. Also, the final section of this chapter describes ANOVA tables. This material is not needed to understand chapters 11–14, but is included for completeness. The amount of detail which is necessary to explain ANOVA tables far exceeds the complexity of most other explanations appearing in this book. We therefore suggest that the reader bypass §10.5 on a first reading.

10.4. Correlation

Regression models presuppose there is an outcome or response variable of some importance, and that interest in other variables derives from their potential influence on the outcome variable. Historically, the development of regression analysis was preceded by another approach known as correlation analysis.

Consider the case of two variables, Y and X. In a regression analysis, we assume Y has a normal distribution, but X may take any value and no distributional assumptions about X are required. Thus, X may be determined along with Y, X values may be fixed by the experimenter, e.g., X represents treatment received in a randomized clinical trial, or any number of factors might influence the X values observed in the data. Correlation analysis is restricted to the situation when X and Y are both random variables, and commonly assumes that both variables have a normal distribution.

10 Linear Regression Models for Medical Data

118

Соседние файлы в папке Английские материалы