Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
540
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

 

 

 

 

 

OLS regression

 

 

 

175

1

2

3

4

5

6

7

8

9

10

11

2.42

0.97

0.52

0.07

-0.38

-0.83

-1.28

-1.73

-1.18

-1.63

-1.08

12

13

14

15

 

 

 

 

 

 

 

-0.53

0.02

1.57

3.12

 

 

 

 

 

 

 

>plot(women$height,women$weight, xlab="Height (in inches)", ylab="Weight (in pounds)")

>abline(fit)

From the output, you see that the prediction equation is

^

Weight = − 87.52 + 3.45 × Height

Because a height of 0 is impossible, you wouldn’t try to give a physical interpretation to the intercept. It merely becomes an adjustment constant. From the Pr(>|t|) column, you see that the regression coefficient (3.45) is significantly different from zero (p < 0.001) and indicates that there’s an expected increase of 3.45 pounds of weight for every 1 inch increase in height. The multiple R-squared (0.991) indicates that the model accounts for 99.1% of the variance in weights. The multiple R-squared is also the squared correlation between the actual and predicted value (that is, R2 = ryy). The residual standard error (1.53 pounds) can be thought of as the average error in predicting weight from height using this model. The F statistic tests whether the predictor variables, taken together, predict the response variable above chance levels. Because there’s only one predictor variable in simple regression, in this example the F test is equivalent to the t-test for the regression coefficient for height.

For demonstration purposes, we’ve printed out the actual, predicted, and residual values. Evidently, the largest residuals occur for low and high heights, which can also be seen in the plot (figure 8.1).

The plot suggests that you might be able to improve on the prediction by using a line with one bend. For example, a model of the form Yi = β0 + β1X + β2X2 may provide a better fit to the data. Polynomial regression allows you to predict a response variable from an explanatory variable, where the form of the relationship is an nthdegree polynomial.

8.2.3Polynomial regression

The plot in figure 8.1 suggests that you might be able to improve your prediction using a regression with a quadratic term (that is, X 2). You can fit a quadratic equation using the statement

fit2 <- lm(weight ~ height + I(height^2), data=women)

The new term I(height^2) requires explanation. height^2 adds a height-squared term to the prediction equation. The I function treats the contents within the parentheses as an R regular expression. You need this because the ^ operator has a special meaning in formulas that you don’t want to invoke here (see table 8.2).

176

CHAPTER 8 Regression

The following listing shows the results of fitting the quadratic equation.

Listing 8.2 Polynomial regression

>fit2 <- lm(weight ~ height + I(height^2), data=women)

>summary(fit2)

Call:

lm(formula=weight ~ height + I(height^2), data=women)

Residuals:

 

 

 

 

 

 

 

Min

1Q

Median

3Q

 

Max

 

 

-0.5094 -0.2961 -0.0094

0.2862

0.5971

 

 

Coefficients:

 

 

 

 

 

 

 

 

Estimate Std. Error

t

value Pr(>|t|)

 

(Intercept) 261.87818

25.19677

 

10.39

2.4e-07 ***

height

-7.34832

0.77769

 

-9.45

6.6e-07

***

I(height^2)

0.08306

0.00598

 

13.89

9.3e-09

***

---

 

 

 

 

 

 

 

Signif. codes:

0 '***' 0.001 '**'

0.01

'*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.384 on 12 degrees of freedom Multiple R-squared: 0.999, Adjusted R-squared: 0.999 F-statistic: 1.14e+04 on 2 and 12 DF, p-value: <2e-16

>plot(women$height,women$weight, xlab="Height (in inches)", ylab="Weight (in lbs)")

>lines(women$height,fitted(fit2))

From this new analysis, the prediction equation is

^

7 . 3 5

× Height + 0.083 × Height2

Weight = 261.88

and both regression coefficients are significant at the p < 0.0001 level. The amount of variance accounted for has increased to 99.9%. The significance of the squared term (t = 13.89, p < .001) suggests that inclusion of the quadratic term improves the model fit. If you look at the plot of fit2 (figure 8.2) you can see that the curve does indeed provide a better fit.

Figure 8.2 Quadratic regression for weight predicted by height

 

160

 

150

Weight (in lbs)

140

 

130

 

120

58

60

62

64

66

68

70

72

Height (in inches)

OLS regression

177

Linear vs. nonlinear models

Note that this polynomial equation still fits under the rubric of linear regression. It’s linear because the equation involves a weighted sum of predictor variables (height and height-squared in this case). Even a model such as

Yi = β0 × logX1 + β2 × sinX2

would be considered a linear model (linear in terms of the parameters) and fit with the formula

Y ~ log(X1) + sin(X2)

In contrast, here’s an example of a truly nonlinear model:

Yi = β0 + β1 ex/β2

Nonlinear models of this form can be fit with the nls() function.

In general, an nth-degree polynomial produces a curve with n-1 bends. To fit a cubic polynomial, you’d use

fit3 <- lm(weight ~ height + I(height^2) +I(height^3), data=women)

Although higher polynomials are possible, I’ve rarely found that terms higher than cubic are necessary.

Before we move on, I should mention that the scatterplot() function in the car package provides a simple and convenient method of plotting a bivariate relationship. The code

library(car)

 

scatterplot(weight ~ height, data=women,

 

spread=FALSE, smoother.args=list(lty=2), pch=19,

 

main="Women Age 30-39",

 

xlab="Height (inches)",

Women Age 30−39

ylab="Weight (lbs.)")

produces the graph in figure 8.3.

Figure 8.3 Scatter plot of height by weight, with linear and smoothed fits, and marginal box plots

 

160

 

 

 

 

 

 

 

 

150

 

 

 

 

 

 

 

(lbs.)

140

 

 

 

 

 

 

 

Weight

130

 

 

 

 

 

 

 

 

120

 

 

 

 

 

 

 

 

58

60

62

64

66

68

70

72

Height (inches)

178

CHAPTER 8 Regression

This enhanced plot provides the scatter plot of weight with height, box plots for each variable in their respective margins, the linear line of best fit, and a smoothed (loess) fit line. The spread=FALSE options suppress spread and asymmetry information. The smoother.args=list(lty=2)option specifies that the loess fit be rendered as a dashed line. The pch=19 options display points as filled circles (the default is open circles). You can tell at a glance that the two variables are roughly symmetrical and that a curved line will fit the data points better than a straight line.

8.2.4Multiple linear regression

When there’s more than one predictor variable, simple linear regression becomes multiple linear regression, and the analysis grows more involved. Technically, polynomial regression is a special case of multiple regression. Quadratic regression has two

predictors (X and X 2), and cubic regression has three predictors (X, X 2, and X 3). Let’s look at a more general example.

We’ll use the state.x77 dataset in the base package for this example. Suppose you want to explore the relationship between a state’s murder rate and other characteristics of the state, including population, illiteracy rate, average income, and frost levels (mean number of days below freezing).

Because the lm() function requires a data frame (and the state.x77 dataset is contained in a matrix), you can simplify your life with the following code:

states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")])

This code creates a data frame called states, containing the variables you’re interested in. You’ll use this new data frame for the remainder of the chapter.

A good first step in multiple regression is to examine the relationships among the variables two at a time. The bivariate correlations are provided by the cor() function, and scatter plots are generated from the scatterplotMatrix() function in the car package (see the following listing and figure 8.4).

Listing 8.3 Examining bivariate relationships

> states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")])

> cor(states)

 

 

 

 

 

Murder Population Illiteracy Income Frost

Murder

1.00

0.34

0.70

-0.23 -0.54

Population

0.34

1.00

0.11

0.21

-0.33

Illiteracy

0.70

0.11

1.00

-0.44

-0.67

Income

-0.23

0.21

-0.44

1.00

0.23

Frost

-0.54

-0.33

-0.67

0.23

1.00

>library(car)

>scatterplotMatrix(states, spread=FALSE, smoother.args=list(lty=2), main="Scatter Plot Matrix")

OLS regression

179

Murder

20000

10000

0

6000

4500

3000

2

6

10

14

Scatterplot matrix

0

10000

20000

Population

3000

4500

6000

Illiteracy

Income

0.5

1.5

2.5

14

10

6

2

2.5

1.5

0.5

Frost

0 50 100

0 50 100

Figure 8.4 Scatter plot matrix of dependent and independent variables for the states data, including linear and smoothed fits, and marginal distributions (kernel-density plots and rug plots)

By default, the scatterplotMatrix() function provides scatter plots of the variables with each other in the off-diagonals and superimposes smoothed (loess) and linear fit lines on these plots. The principal diagonal contains density and rug plots for each variable.

You can see that murder rate may be bimodal and that each of the predictor variables is skewed to some extent. Murder rates rise with population and illiteracy, and they fall with higher income levels and frost. At the same time, colder states have lower illiteracy rates and population and higher incomes.

Now let’s fit the multiple regression model with the lm() function (see the following listing).

Listing 8.4 Multiple linear regression

> states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")])

> fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states)

> summary(fit)

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]