Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Robert I. Kabacoff - R in action

.pdf
Скачиваний:
88
Добавлен:
02.06.2015
Размер:
12.13 Mб
Скачать

206

CHAPTER 8 Regression

When the model violates the normality assumption, you typically attempt a transformation of the response variable. You can use the powerTransform() function in the car package to generate a maximum-likelihood estimation of the power λ most likely to normalize the variable X λ. In the next listing, this is applied to the states data.

Listing 8.10 Box–Cox transformation to normality

>library(car)

>summary(powerTransform(states$Murder))

bcPower Transformation to Normality

 

Est.Power

Std.Err.

Wald Lower Bound

Wald Upper Bound

states$Murder

0.6

0.26

0.088

1.1

Likelihood ratio tests about transformation parameters

LRT df pval

LR test, lambda=(0) 5.7 1 0.017

LR test, lambda=(1) 2.1 1 0.145

The results suggest that you can normalize the variable Murder by replacing it with Murder0.6. Because 0.6 is close to 0.5, you could try a square root transformation to improve the model’s fit to normality. But in this case the hypothesis that λ=1 can’t be rejected (p = 0.145), so there’s no strong evidence that a transformation is actually needed in this case. This is consistent with the results of the Q-Q plot in figure 8.9.

When the assumption of linearity is violated, a transformation of the predictor variables can often help. The boxTidwell() function in the car package can be used to generate maximum-likelihood estimates of predictor powers that can improve linearity. An example of applying the Box–Tidwell transformations to a model that predicts state murder rates from their population and illiteracy rates follows:

>library(car)

>boxTidwell(Murder~Population+Illiteracy,data=states)

 

Score Statistic p-value MLE of lambda

Population

-0.32

0.75

0.87

Illiteracy

0.62

0.54

1.36

The results suggest trying the transformations Population.87 and Population1.36 to achieve greater linearity. But the score tests for Population (p = .75) and Illiteracy (p = .54) suggest that neither variable needs to be transformed. Again, these results are consistent with the component plus residual plots in figure 8.11.

Finally, transformations of the response variable can help in situations of heteroscedasticity (nonconstant error variance). You saw in listing 8.7 that the spreadLevelPlot() function in the car package offers a power transformation for improving homoscedasticity. Again, in the case of the states example, the constant error variance assumption is met and no transformation is necessary.

Selecting the “best” regression model

207

A caution concerning transformations

There’s an old joke in statistics: If you can’t prove A, prove B and pretend it was A. (For statisticians, that’s pretty funny.) The relevance here is that if you transform your variables, your interpretations must be based on the transformed variables, not the original variables. If the transformation makes sense, such as the log of income or the inverse of distance, the interpretation is easier. But how do you interpret the relationship between the frequency of suicidal ideation and the cube root of depression? If a transformation doesn’t make sense, you should avoid it.

8.5.3Adding or deleting variables

Changing the variables in a model will impact the fit of a model. Sometimes, adding an important variable will correct many of the problems that we’ve discussed. Deleting a troublesome variable can do the same thing.

Deleting variables is a particularly important approach for dealing with multicollinearity. If your only goal is to make predictions, then multicollinearity isn’t a problem. But if you want to make interpretations about individual predictor variables, then you must deal with it. The most common approach is to delete one of the variables involved in the multicollinearity (that is, one of the variables with a vif > 2 ). An alternative is to use ridge regression, a variant of multiple regression designed to deal with multicollinearity situations.

8.5.4Trying a different approach

As you’ve just seen, one approach to dealing with multicollinearity is to fit a different type of model (ridge regression in this case). If there are outliers and/or influential observations, you could fit a robust regression model rather than an OLS regression. If you’ve violated the normality assumption, you can fit a nonparametric regression model. If there’s significant nonlinearity, you can try a nonlinear regression model. If you’ve violated the assumptions of independence of errors, you can fit a model that specifically takes the error structure into account, such as time-series models or multilevel regression models. Finally, you can turn to generalized linear models to fit a wide range of models in situations where the assumptions of OLS regression don’t hold.

We’ll discuss some of these alternative approaches in chapter 13. The decision regarding when to try to improve the fit of an OLS regression model and when to try a different approach, is a complex one. It’s typically based on knowledge of the subject matter and an assessment of which approach will provide the best result.

Speaking of best results, let’s turn now to the problem of deciding which predictor variables to include in our regression model.

8.6Selecting the “best” regression model

When developing a regression equation, you’re implicitly faced with a selection of many possible models. Should you include all the variables under study, or drop ones

208

CHAPTER 8 Regression

that don’t make a significant contribution to prediction? Should you add polynomial and/or interaction terms to improve the fit? The selection of a final regression model always involves a compromise between predictive accuracy (a model that fits the data as well as possible) and parsimony (a simple and replicable model). All things being equal, if you have two models with approximately equal predictive accuracy, you favor the simpler one. This section describes methods for choosing among competing models. The word “best” is in quotation marks, because there’s no single criterion you can use to make the decision. The final decision requires judgment on the part of the investigator. (Think of it as job security.)

8.6.1Comparing models

You can compare the fit of two nested models using the anova() function in the base installation. A nested model is one whose terms are completely included in the other model. In our states multiple regression model, we found that the regression coefficients for Income and Frost were nonsignificant. You can test whether a model without these two variables predicts as well as one that includes them (see the following listing).

Listing 8.11 Comparing nested models using the anova() function

>fit1 <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states)

>fit2 <- lm(Murder ~ Population + Illiteracy, data=states)

>anova(fit2, fit1)

Analysis of Variance Table

Model

1:

Murder

~ Population

+ Illiteracy

 

Model

2:

Murder

~ Population

+ Illiteracy

+ Income + Frost

 

Res.Df

RSS

Df

Sum

of Sq

F

Pr(>F)

1

 

47

289.246

 

 

 

 

 

2

 

45

289.167

2

0.079 0.0061

 

0.994

Here, model 1 is nested within model 2. The anova() function provides a simultaneous test that Income and Frost add to linear prediction above and beyond Population and Illiteracy. Because the test is nonsignificant (p = .994), we conclude that they don’t add to the linear prediction and we’re justified in dropping them from our model.

The Akaike Information Criterion (AIC) provides another method for comparing models. The index takes into account a model’s statistical fit and the number of parameters needed to achieve this fit. Models with smaller AIC values—indicating adequate fit with fewer parameters—are preferred. The criterion is provided by the AIC() function (see the following listing).

Listing 8.12 Comparing models with the AIC

>fit1 <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states)

>fit2 <- lm(Murder ~ Population + Illiteracy, data=states)

Selecting the “best” regression model

209

> AIC(fit1,fit2)

 

df

AIC

fit1

6

241.6429

fit2

4

237.6565

The AIC values suggest that the model without Income and Frost is the better model. Note that although the ANOVA approach requires nested models, the AIC approach doesn’t.

Comparing two models is relatively straightforward, but what do you do when there are four, or ten, or a hundred possible models to consider? That’s the topic of the next section.

8.6.2Variable selection

Two popular approaches to selecting a final set of predictor variables from a larger pool of candidate variables are stepwise methods and all-subsets regression.

STEPWISE REGRESSION

In stepwise selection, variables are added to or deleted from a model one at a time, until some stopping criterion is reached. For example, in forward stepwise regression you add predictor variables to the model one at a time, stopping when the addition of variables would no longer improve the model. In backward stepwise regression, you start with a model that includes all predictor variables, and then delete them one at a time until removing variables would degrade the quality of the model. In stepwise stepwise regression (usually called stepwise to avoid sounding silly), you combine the forward and backward stepwise approaches. Variables are entered one at a time, but at each step, the variables in the model are reevaluated, and those that don’t contribute to the model are deleted. A predictor variable may be added to, and deleted from, a model several times before a final solution is reached.

The implementation of stepwise regression methods vary by the criteria used to enter or remove variables. The stepAIC() function in the MASS package performs stepwise model selection (forward, backward, stepwise) using an exact AIC criterion. In the next listing, we apply backward stepwise regression to the multiple regression problem.

Listing 8.13 Backward stepwise selection

>library(MASS)

>fit1 <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states)

>stepAIC(fit, direction="backward")

Start: AIC=97.75

Murder ~ Population + Illiteracy + Income + Frost

 

 

Df Sum of Sq

RSS

AIC

-

Frost

1

0.02

289.19

95.75

-

Income

1

0.06

289.22

95.76

210

 

 

 

CHAPTER 8 Regression

<none>

 

 

 

289.17

97.75

- Population

1

39.24

328.41

102.11

- Illiteracy

1

144.26

433.43

115.99

Step:

AIC=95.75

 

 

 

Murder ~ Population

+ Illiteracy + Income

 

 

Df Sum

of Sq

RSS

AIC

- Income

1

0.06

289.25

93.76

<none>

 

 

 

289.19

95.75

- Population

1

43.66

332.85

100.78

- Illiteracy

1

236.20

525.38

123.61

Step:

AIC=93.76

 

 

 

Murder ~ Population

+ Illiteracy

 

 

 

Df Sum

of Sq

RSS

AIC

<none>

 

 

 

289.25

93.76

- Population

1

48.52

337.76

99.52

- Illiteracy

1

299.65

588.89

127.31

Call:

lm(formula=Murder ~ Population + Illiteracy, data=states)

Coefficients:

(Intercept) Population Illiteracy

1.6515497 0.0002242 4.0807366

You start with all four predictors in the model. For each step, the AIC column provides the model AIC resulting from the deletion of the variable listed in that row. The AIC value for <none> is the model AIC if no variables are removed. In the first step, Frost is removed, decreasing the AIC from 97.75 to 95.75. In the second step, Income is removed, decreasing the AIC to 93.76. Deleting any more variables would increase the AIC, so the process stops.

Stepwise regression is controversial. Although it may find a good model, there’s no guarantee that it will find the best model. This is because not every possible model is evaluated. An approach that attempts to overcome this limitation is all subsets regression.

ALL SUBSETS REGRESSION

In all subsets regression, every possible model is inspected. The analyst can choose to have all possible results displayed, or ask for the nbest models of each subset size (one predictor, two predictors, etc.). For example, if nbest=2, the two best one-predictor models are displayed, followed by the two best two-predictor models, followed by the two best three-predictor models, up to a model with all predictors.

All subsets regression is performed using the regsubsets() function from the leaps package. You can choose R-squared, Adjusted R-squared, or Mallows Cp statistic as your criterion for reporting “best” models.

As you’ve seen, R-squared is the amount of variance accounted for in the response variable by the predictors variables. Adjusted R-squared is similar, but takes into account the number of parameters in the model. R-squared always increases with the addition

Selecting the “best” regression model

211

of predictors. When the number of predictors is large compared to the sample size, this can lead to significant overfitting. The Adjusted R-squared is an attempt to provide a more honest estimate of the population R-squared—one that’s less likely to take advantage of chance variation in the data. The Mallows Cp statistic is also used as a stopping rule in stepwise regression. It has been widely suggested that a good model is one in which the Cp statistic is close to the number of model parameters (including the intercept).

In listing 8.14, we’ll apply all subsets regression to the states data. The results can be plotted with either the plot() function in the leaps package or the subsets() function in the car package. An example of the former is provided infigure 8.17, and an example of the latter is given in figure 8.18.

Listing 8.14 All subsets regression

library(leaps)

leaps <-regsubsets(Murder ~ Population + Illiteracy + Income + Frost, data=states, nbest=4)

plot(leaps, scale="adjr2")

library(car)

subsets(leaps, statistic="cp",

main="Cp Plot for All Subsets Regression") abline(1,1,lty=2,col="red")

 

0.55

 

0.54

 

0.54

 

0.53

 

0.48

adjr2

0.48

0.48

 

 

0.48

 

0.31

 

0.29

 

0.28

0.1

0.033

(Intercept)

Population

Illiteracy

Income

Frost

Figure 8.17 Best four models for each subset size based on Adjusted R-square

212

CHAPTER 8 Regression

Figure 8.17 can be confusing to read. Looking at the first row (starting at the bottom), you can see that a model with the intercept and Income has an adjusted R-square of 0.33. A model with the intercept and Population has an adjusted R-square of 0.1. Jumping to the 12th row, you see that a model with the intercept, Population, Illiteracy, and Income has an adjusted R-square of 0.54, whereas one with the intercept, Population, and Illiteracy alone has an adjusted R-square of 0.55. Here you see that a model with fewer predictors has a larger adjusted R-square (something that can’t happen with an unadjusted R-square). The graph suggests that the two-predictor model (Population and Illiteracy) is the best.

In figure 8.18, you see the best four models for each subset size based on the Mallows Cp statistic. Better models will fall close to a line with intercept 1 and slope 1. The plot suggests that you consider a two-predictor model with Population and Illiteracy; a three-predictor model with Population, Illiteracy, and Frost, or Population, Illiteracy and Income (they overlap on the graph and are hard to read); or a fourpredictor model with Population, Illiteracy, Income, and Frost. You can reject the other possible models.

In most instances, all subsets regression is preferable to stepwise regression, because more models are considered. However, when the number of predictors is large, the

Cp Plot for All Subsets Regression

 

50

 

40

Statistic: cp

30

 

20

 

10

 

0

In

P: Population

P Il: Illiteracy

In: Income

F: Frost

F

P−F

P−In−F

 

 

 

 

Il

 

 

Il−InF

 

 

Il−In−F

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

P−Il−In−F

 

 

 

 

 

 

 

 

P−Il−InF

 

 

 

 

 

P−Il

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1.0

1.5

2.0

2.5

3.0

3.5

4.0

 

 

 

 

 

 

Subset Size

 

 

 

 

 

 

Figure 8.18 Best four models for each subset size based on the Mallows Cp statistic

Taking the analysis further

213

procedure can require significant computing time. In general, automated variable selection methods should be seen as an aid rather than a directing force in model selection. A well-fitting model that doesn’t make sense doesn’t help you. Ultimately, it’s your knowledge of the subject matter that should guide you.

8.7Taking the analysis further

We’ll end our discussion of regression by considering methods for assessing model generalizability and predictor relative importance.

8.7.1Cross-validation

In the previous section, we examined methods for selecting the variables to include in a regression equation. When description is your primary goal, the selection and interpretation of a regression model signals the end of your labor. But when your goal is prediction, you can justifiably ask, “How well will this equation perform in the real world?”

By definition, regression techniques obtain model parameters that are optimal for a given set of data. In OLS regression, the model parameters are selected to minimize the sum of squared errors of prediction (residuals), and conversely, maximize the amount of variance accounted for in the response variable (R-squared). Because the equation has been optimized for the given set of data, it won’t perform as well with a new set of data.

We began this chapter with an example involving a research physiologist who wanted to predict the number of calories an individual will burn from the duration and intensity of their exercise, age, gender, and BMI. If you fit an OLS regression equation to this data, you’ll obtain model parameters that uniquely maximize the R-squared for this particular set of observations. But our researcher wants to use this equation to predict the calories burned by individuals in general, not only those in the original study. You know that the equation won’t perform as well with a new sample of observations, but how much will you lose? Cross-validation is a useful method for evaluating the generalizability of a regression equation.

In cross-validation, a portion of the data is selected as the training sample and a portion is selected as the hold-out sample. A regression equation is developed on the training sample, and then applied to the hold-out sample. Because the hold-out sample wasn’t involved in the selection of the model parameters, the performance on this sample is a more accurate estimate of the operating characteristics of the model with new data.

In k-fold cross-validation, the sample is divided into k subsamples. Each of the k subsamples serves as a hold-out group and the combined observations from the remaining k-1 subsamples serves as the training group. The performance for the k prediction equations applied to the k hold-out samples are recorded and then averaged. (When k equals n, the total number of observations, this approach is called jackknifing.)

214

CHAPTER 8 Regression

You can perform k-fold cross-validation using the crossval() function in the bootstrap package. The following listing provides a function (called shrinkage()) for cross-validating a model’s R-square statistic using k-fold cross-validation.

Listing 8.15 Function for k-fold cross-validated R-square

shrinkage <- function(fit, k=10){ require(bootstrap)

theta.fit <- function(x,y){lsfit(x,y)}

theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef}

x <- fit$model[,2:ncol(fit$model)] y <- fit$model[,1]

results <- crossval(x, y, theta.fit, theta.predict, ngroup=k) r2 <- cor(y, fit$fitted.values)^2

r2cv <- cor(y, results$cv.fit)^2 cat("Original R-square =", r2, "\n")

cat(k, "Fold Cross-Validated R-square =", r2cv, "\n") cat("Change =", r2-r2cv, "\n")

}

Using this listing you define your functions, create a matrix of predictor and predicted values, get the raw R-squared, and get the cross-validated R-squared. ( Chapter 12 covers bootstrapping in detail.)

The shrinkage() function is then used to perform a 10-fold cross-validation with the states data, using a model with all four predictor variables:

>fit <- lm(Murder ~ Population + Income + Illiteracy + Frost, data=states)

>shrinkage(fit)

Original R-square=0.567

10 Fold Cross-Validated R-square=0.4481

Change=0.1188

You can see that the R-square based on our sample (0.567) is overly optimistic. A better estimate of the amount of variance in murder rates our model will account for with new data is the cross-validated R-square (0.448). (Note that observations are assigned to the k groups randomly, so you will get a slightly different result each time you execute the shrinkage() function.)

You could use cross-validation in variable selection by choosing a model that demonstrates better generalizability. For example, a model with two predictors (Population and Illiteracy) shows less R-square shrinkage (.03 versus .12) than the full model:

>fit2 <- lm(Murder~Population+Illiteracy,data=states)

>shrinkage(fit2)

Original R-square=0.5668327

10 Fold Cross-Validated R-square=0.5346871

Change=0.03214554

Taking the analysis further

215

This may make the two-predictor model a more attractive alternative.

All other things being equal, a regression equation that’s based on a larger training sample and one that’s more representative of the population of interest will crossvalidate better. You’ll get less R-squared shrinkage and make more accurate predictions.

8.7.2Relative importance

Up to this point in the chapter, we’ve been asking, “Which variables are useful for predicting the outcome?” But often your real interest is in the question, “Which variables are most important in predicting the outcome?” You implicitly want to rank-order the predictors in terms of relative importance. There may be practical grounds for asking the second question. For example, if you could rank-order leadership practices by their relative importance for organizational success, you could help managers focus on the behaviors they most need to develop.

If predictor variables were uncorrelated, this would be a simple task. You would rank-order the predictor variables by their correlation with the response variable. In most cases, though, the predictors are correlated with each other, and this complicates the task significantly.

There have been many attempts to develop a means for assessing the relative importance of predictors. The simplest has been to compare standardized regression coefficients. Standardized regression coefficients describe the expected change in the response variable (expressed in standard deviation units) for a standard deviation change in a predictor variable, holding the other predictor variables constant. You can obtain the standardized regression coefficients in R by standardizing each of the variables in your dataset to a mean of 0 and standard deviation of 1 using the scale() function, before submitting the dataset to a regression analysis. (Note that because the scale() function returns a matrix and the lm() function requires a data frame, you convert between the two in an intermediate step.) The code and results for our multiple regression problem are shown here:

>zstates <- as.data.frame(scale(states))

>zfit <- lm(Murder~Population + Income + Illiteracy + Frost, data=zstates)

>coef(zfit)

(Intercept)

Population

Income

Illiteracy

Frost

-9.406e-17

2.705e-01

1.072e-02

6.840e-01

8.185e-03

Here you see that a one standard deviation increase in illiteracy rate yields a 0.68 standard deviation increase in murder rate, when controlling for population, income, and temperature. Using standardized regression coefficients as our guide, Illiteracy is the most important predictor and Frost is the least.

There have been many other attempts at quantifying relative importance. Relative importance can be thought of as the contribution each predictor makes to R-square, both alone and in combination with other predictors. Several possible approaches to relative importance are captured in the relaimpo package written by Ulrike Grömping (http://prof.beuth-hochschule.de/groemping/relaimpo/).

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]