Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
540
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

168

CHAPTER 8 Regression

From a theoretical point of view, the analysis will help answer such questions as these:

What’s the relationship between exercise duration and calories burned? Is it linear or curvilinear? For example, does exercise have less impact on the number of calories burned after a certain point?

How does effort (the percentage of time at the target heart rate, the average walking speed) factor in?

Are these relationships the same for young and old, male and female, heavy and slim?

From a practical point of view, the analysis will help answer such questions as the following:

How many calories can a 30-year-old man with a BMI of 28.7 expect to burn if he walks for 45 minutes at an average speed of 4 miles per hour and stays within his target heart rate 80% of the time?

What’s the minimum number of variables you need to collect in order to accurately predict the number of calories a person will burn when walking?

How accurate will your prediction tend to be?

Because regression analysis plays such a central role in modern statistics, we’ll cover it in some depth in this chapter. First, we’ll look at how to fit and interpret regression models. Next, we’ll review a set of techniques for identifying potential problems with these models and how to deal with them. Third, we’ll explore the issue of variable selection. Of all the potential predictor variables available, how do you decide which ones to include in your final model? Fourth, we’ll address the question of generalizability. How well will your model work when you apply it in the real world? Finally, we’ll consider relative importance. Of all the predictors in your model, which one is the most important, the second most important, and the least important?

As you can see, we’re covering a lot of ground. Effective regression analysis is an interactive, holistic process with many steps, and it involves more than a little skill. Rather than break it up into multiple chapters, I’ve opted to present this topic in a single chapter in order to capture this flavor. As a result, this will be the longest and most involved chapter in the book. Stick with it to the end, and you’ll have all the tools you need to tackle a wide variety of research questions. Promise!

8.1The many faces of regression

The term regression can be confusing because there are so many specialized varieties (see table 8.1). In addition, R has powerful and comprehensive features for fitting regression models, and the abundance of options can be confusing as well. For example, in 2005, Vito Ricci created a list of more than 205 functions in R that are used to generate regression analyses (http://mng.bz/NJhu).

 

The many faces of regression

169

Table 8.1 Varieties of regression analysis

 

 

 

 

Type of regression

Typical use

 

 

 

Simple linear

Predicting a quantitative response variable from a quantitative explanatory

 

variable.

 

Polynomial

Predicting a quantitative response variable from a quantitative explanatory

 

variable, where the relationship is modeled as an nth order polynomial.

Multiple linear

Predicting a quantitative response variable from two or more explanatory

 

variables.

 

Multilevel

Predicting a response variable from data that have a hierarchical structure

 

(for example, students within classrooms within schools). Also called hier-

 

archical, nested, or mixed models.

 

Multivariate

Predicting more than one response variable from one or more explanatory

 

variables.

 

Logistic

Predicting a categorical response variable from one or more explanatory

 

variables.

 

Poisson

Predicting a response variable representing counts from one or more

 

 

explanatory variables.

 

Cox proportional hazards

Predicting time to an event (death, failure, relapse) from one or more

 

 

explanatory variables.

 

Time-series

Modeling time-series data with correlated errors.

 

Nonlinear

Predicting a quantitative response variable from one or more explanatory

 

variables, where the form of the model is nonlinear.

 

Nonparametric

Predicting a quantitative response variable from one or more explanatory

 

variables, where the form of the model is derived from the data and not

 

specified a priori.

 

Robust

Predicting a quantitative response variable from one or more explanatory

 

variables using an approach that’s resistant to the effect of influential

 

 

observations.

 

 

 

 

In this chapter, we’ll focus on regression methods that fall under the rubric of ordinary least squares (OLS) regression, including simple linear regression, polynomial regression, and multiple linear regression. OLS regression is the most common variety of statistical analysis today. Other types of regression models (including logistic regression and Poisson regression) will be covered in chapter 13.

8.1.1Scenarios for using OLS regression

In OLS regression, a quantitative dependent variable is predicted from a weighted sum of predictor variables, where the weights are parameters estimated from the data. Let’s take a look at a concrete example (no pun intended), loosely adapted from Fwa (2006).

170

CHAPTER 8 Regression

An engineer wants to identify the most important factors related to bridge deterioration (such as age, traffic volume, bridge design, construction materials and methods, construction quality, and weather conditions) and determine the mathematical form of these relationships. She collects data on each of these variables from a representative sample of bridges and models the data using OLS regression.

The approach is highly interactive. She fits a series of models, checks their compliance with underlying statistical assumptions, explores any unexpected or aberrant findings, and finally chooses the “best” model from among many possible models. If successful, the results will help her to

Focus on important variables, by determining which of the many collected variables are useful in predicting bridge deterioration, along with their relative importance.

Look for bridges that are likely to be in trouble, by providing an equation that can be used to predict bridge deterioration for new cases (where the values of the predictor variables are known, but the degree of bridge deterioration isn’t).

Take advantage of serendipity, by identifying unusual bridges. If she finds that some bridges deteriorate much faster or slower than predicted by the model, a study of these outliers may yield important findings that could help her to understand the mechanisms involved in bridge deterioration.

Bridges may hold no interest for you. I’m a clinical psychologist and statistician, and I know next to nothing about civil engineering. But the general principles apply to an amazingly wide selection of problems in the physical, biological, and social sciences. Each of the following questions could also be addressed using an OLS approach:

What’s the relationship between surface stream salinity and paved road surface area (Montgomery, 2007)?

What aspects of a user’s experience contribute to the overuse of massively multiplayer online role playing games (MMORPGs) (Hsu, Wen, & Wu, 2009)?

Which qualities of an educational environment are most strongly related to higher student achievement scores?

What’s the form of the relationship between blood pressure, salt intake, and age? Is it the same for men and women?

What’s the impact of stadiums and professional sports on metropolitan area development (Baade & Dye, 1990)?

What factors account for interstate differences in the price of beer (Culbertson & Bradford, 1991)? (That one got your attention!)

Our primary limitation is our ability to formulate an interesting question, devise a useful response variable to measure, and gather appropriate data.

8.1.2What you need to know

For the remainder of this chapter, I’ll describe how to use R functions to fit OLS regression models, evaluate the fit, test assumptions, and select among competing

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]