Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный исследовательский университет «Высшая школа экономики»

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

R in Action, Second Edition.pdf

Скачиваний:

540

Добавлен:

26.03.2016

Размер:

20.33 Mб

Скачать

☆

<<< < Предыдущая 51 52 53 54 55 56 57 58 59 60 61 6263 / 17363 64 65 66 67 68 69 70 71 72 73 74 75 > Следующая >>>

182	CHAPTER 8 Regression

You can see from this graph that as the weight of the car increases, the relationship between horsepower and miles per gallon weakens. For wt=4.2, the line is almost horizontal, indicating that as hp increases, mpg doesn’t change.

Unfortunately, fitting the model is only the first step in the analysis. Once you fit a regression model, you need to evaluate whether you’ve met the statistical assumptions underlying your approach before you can have confidence in the inferences you draw. This is the topic of the next section.

hp*wt effect plot

				wt
				2.2
				3.2
				4.2
	25
mpg	20
	15
	50	100	150	200	250	300

Figure 8.5 Interaction plot for hp*wt. This plot displays the relationship between mpg and hp at three values of wt.

8.3Regression diagnostics

In the previous section, you used the lm() function to fit an OLS regression model and the summary() function to obtain the model parameters and summary statistics. Unfortunately, nothing in this printout tells you whether the model you’ve fit is appropriate. Your confidence in inferences about regression parameters depends on the degree to which you’ve met the statistical assumptions of the OLS model. Although the summary() function in listing 8.4 describes the model, it provides no information concerning the degree to which you’ve satisfied the statistical assumptions underlying the model.

Why is this important? Irregularities in the data or misspecifications of the relationships between the predictors and the response variable can lead you to settle on a model that’s wildly inaccurate. On the one hand, you may conclude that a predictor and a response variable are unrelated when, in fact, they are. On the other hand, you may conclude that a predictor and a response variable are related when, in fact, they aren’t! You may also end up with a model that makes poor predictions when applied in real-world settings, with significant and unnecessary error.

Let’s look at the output from the confint() function applied to the states multiple regression problem in section 8.2.4:

> states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")])

>fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states)

>confint(fit)

2.5 %

97.5 %

(Intercept) -6.55e+00 9.021318

		Regression diagnostics	183
Population	4.14e-05	0.000406
Illiteracy	2.38e+00	5.903874
Income	-1.31e-03	0.001441
Frost	-1.97e-02	0.020830

The results suggest that you can be 95% confident that the interval [2.38, 5.90] contains the true change in murder rate for a 1% change in illiteracy rate. Additionally, because the confidence interval for Frost contains 0, you can conclude that a change in temperature is unrelated to murder rate, holding the other variables constant. But your faith in these results is only as strong as the evidence you have that your data satisfies the statistical assumptions underlying the model.

A set of techniques called regression diagnostics provides the necessary tools for evaluating the appropriateness of the regression model and can help you to uncover and correct problems. We’ll start with a standard approach that uses functions that come with R’s base installation. Then we’ll look at newer, improved methods available through the car package.

8.3.1A typical approach

R’s base installation provides numerous methods for evaluating the statistical assumptions in a regression analysis. The most common approach is to apply the plot() function to the object returned by the lm(). Doing so produces four graphs that are useful for evaluating the model fit. Applying this approach to the simple linear regression example

fit <- lm(weight ~ height, data=women) par(mfrow=c(2,2))

plot(fit)

produces the graphs shown in figure 8.6. The par(mfrow=c(2,2)) statement is used to combine the four plots produced by the plot() function into one large 2 × 2 graph. The par() function is described in chapter 3.

To understand these graphs, consider the assumptions of OLS regression:

■Normality—If the dependent variable is normally distributed for a fixed set of predictor values, then the residual values should be normally distributed with a mean of 0. The Normal Q-Q plot (upper right) is a probability plot of the standardized residuals against the values that would be expected under normality. If you’ve met the normality assumption, the points on this graph should fall on the straight 45-degree line. Because they don’t, you’ve clearly violated the normality assumption.

■Independence—You can’t tell if the dependent variable values are independent from these plots. You have to use your understanding of how the data was collected. There’s no a priori reason to believe that one woman’s weight influences another woman’s weight. If you found out that the data were sampled from families, you might have to adjust your assumption of independence.

184	CHAPTER 8 Regression
Residuals vs Fitted	Normal Q−Q

	3				15					15
Residuals	3					Standardizedresiduals	0 1 2
Residuals	0−11 2					Standardizedresiduals	0 1 2
	1									1
										1
	−2		8				−1	8
			8
	120	130	140	150	160			−1	0	1
		Fitted values						Theoretical Quantiles

	1.5	Scale−Location
	1.5				15
residuals	1				residuals	1 2
residuals	1.0		8		residuals	1 2
			8
Standardized	0.5				Standardized	−1 0
	0.0
	120	130	140	150	160
		Fitted values

Residuals vs Leverage

				15		1
				1		0.5
						0.5
				14
	Cook’s distance
0.00	0.05	0.10	0.15	0.20	0.25
		Leverage

Figure 8.6 Diagnostic plots for the regression of weight on height

■Linearity—If the dependent variable is linearly related to the independent variables, there should be no systematic relationship between the residuals and the predicted (that is, fitted) values. In other words, the model should capture all the systematic variance present in the data, leaving nothing but random noise. In the Residuals vs. Fitted graph (upper left), you see clear evidence of a curved relationship, which suggests that you may want to add a quadratic term to the regression.

■Homoscedasticity—If you’ve met the constant variance assumption, the points in the Scale-Location graph (bottom left) should be a random band around a horizontal line. You seem to meet this assumption.

Finally, the Residuals vs. Leverage graph (bottom right) provides information about individual observations that you may wish to attend to. The graph identifies outliers, high-leverage points, and influential observations. Specifically:

■An outlier is an observation that isn’t predicted well by the fitted regression model (that is, has a large positive or negative residual).

Regression diagnostics

185

■An observation with a high leverage value has an unusual combination of predictor values. That is, it’s an outlier in the predictor space. The dependent variable value isn’t used to calculate an observation’s leverage.

■An influential observation is an observation that has a disproportionate impact on the determination of the model parameters. Influential observations are identified using a statistic called Cook’s distance, or Cook’s D.

To be honest, I find the Residuals vs. Leverage plot difficult to read and not useful. You’ll see better representations of this information in later sections.

To complete this section, let’s look at the diagnostic plots for the quadratic fit. The necessary code is

fit2 <- lm(weight ~ height + I(height^2), data=women) par(mfrow=c(2,2))

plot(fit2)

and the resulting graph is provided in figure 8.7.

This second set of plots suggests that the polynomial regression provides a better fit with regard to the linearity assumption, normality of residuals (except for observation

Residuals vs Fitted

	0.6					15	2
	0.6					s	2
Residuals	−0.2 0.2	2				Standardized residual	−1 0 1
	−0.6	2			13
					13

		120	130	140	150	160

Fitted values

	1.5	Scale−Location
residuals	1.5				15	1 2
residuals	01.			13	residuals	1 2
	2			13
Standardized	0.5				Standardized	−1 0
	0.0
	120	130	140	150	160
		Fitted values

Normal Q−Q

13 2

−1

Theoretical Quantiles

Residuals vs Leverage

				15
				1
				0.5
		13	2	0.5
		13	2
	Cook’s distance			1
				1
0.0	0.1	0.2	0.3	0.4
		Leverage

Figure 8.7 Diagnostic plots for the regression of weight on height and height-squared

186	CHAPTER 8 Regression

13), and homoscedasticity (constant residual variance). Observation 15 appears to be influential (based on a large Cook’s D value), and deleting it has an impact on the parameter estimates. In fact, dropping both observations 13 and 15 produces a better model fit. To see this, try

newfit <- lm(weight~ height + I(height^2), data=women[-c(13,15),])

for yourself. But you need to be careful when deleting data. Your models should fit your data, not the other way around!

Finally, let’s apply the basic approach to the states multiple regression problem:

states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")])

fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states) par(mfrow=c(2,2))

plot(fit)

The results are displayed in figure 8.8. As you can see from the graph, the model assumptions appear to be well satisfied, with the exception that Nevada is an outlier.

Although these standard diagnostic plots are helpful, better tools are now available in R and I recommend their use over the plot(fit) approach.

Residuals vs Fitted

Nevada

	5
Residuals	0
	−5	Massachusetts
	−5	Rhode Island
		Rhode Island
	4	6	8	10	12	14
			Fitted values
			Scale−Location
residualsStandardized	Nevada
residualsStandardized	1.00.51.5	Rhode Island
			Alaska
	0.0
	4	6	8	10	12	14

Fitted values

Normal Q−Q

	3					Nevada
s	3
residual	2					Alaska
						Alaska
Standardized	−1 0 1
	−2	Rhode Island
		Rhode Island
		−2	−1	0	1	2
			Theoretical Quantiles

Residuals vs Leverage

	3	Nevada
s	3
Standardized residual						1
	−1 0 1 2				Alaska	0.5
	−1 0 1 2		Hawaii			0.5
	−2		Hawaii
		Cook’s distance
		Cook’s distance				1
						1
	0.0	0.1	0.2	0.3	0.4
			Leverage

Figure 8.8 Diagnostic plots for the regression of murder rate on state characteristics

<<< < Предыдущая 51 52 53 54 55 56 57 58 59 60 61 6263 / 17363 64 65 66 67 68 69 70 71 72 73 74 75 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
05.08.2019741.83 Кб0psihologia.rtf
#
02.06.2015162.69 Кб76Psyh_final_ver.docx
#
02.06.2015141.74 Кб44Psyh_final_ver.docx
#
26.03.2016226.3 Кб23public_corporation.doc
#
26.03.2016451.53 Кб7pud_finansovyy-menedjment_318476.pdf
#
26.03.201620.33 Mб540R in Action, Second Edition.pdf
#
26.03.2016296.21 Кб17Radaev_Kak_napisat_akademicheskiy_text.pdf
#
26.03.20163.76 Mб4Raeff_Modernity.pdf
#
26.03.20162.12 Mб19raigorodskii_d_ya_hrestomatiya_psihologiya_lich.pdf
#
02.06.2015494.59 Кб6raschet_SRK_smorodin.doc
#
02.06.201563.98 Кб4referat_IOGP_3.docx