Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Robert I. Kabacoff - R in action

.pdf
Скачиваний:
88
Добавлен:
02.06.2015
Размер:
12.13 Mб
Скачать

166

CHAPTER 7 Basic statistics

t.test(y1, y2, paired=TRUE)

where y1 and y2 are the numeric vectors for the two dependent groups. The results are as follows:

>library(MASS)

>sapply(UScrime[c("U1","U2")], function(x)(c(mean=mean(x),sd=sd(x)))) U1 U2

mean 95.5 33.98 sd 18.0 8.45

> with(UScrime, t.test(U1, U2, paired=TRUE))

Paired t-test

data: U1 and U2

t = 32.4066, df = 46, p-value < 2.2e-16

alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:

57.67003 65.30870 sample estimates:

mean of the differences 61.48936

The mean difference (61.5) is large enough to warrant rejection of the hypothesis that the mean unemployment rate for older and younger males is the same. Younger males have a higher rate. In fact, the probability of obtaining a sample difference this large if the population means are equal is less than 0.00000000000000022 (that is, 2.2e–16).

7.4.3When there are more than two groups

What do you do if you want to compare more than two groups? If you can assume that the data are independently sampled from normal populations, you can use analysis of variance (ANOVA). ANOVA is a comprehensive methodology that covers many experimental and quasi-experimental designs. As such, it has earned its own chapter. Feel free to abandon this section and jump to chapter 9 at any time.

7.5Nonparametric tests of group differences

If you’re unable to meet the parametric assumptions of a t-test or ANOVA, you can turn to nonparametric approaches. For example, if the outcome variables are severely skewed or ordinal in nature, you may wish to use the techniques in this section.

7.5.1Comparing two groups

If the two groups are independent, you can use the Wilcoxon rank sum test (more popularly known as the Mann–Whitney U test) to assess whether the observations are sampled from the same probability distribution (that is, whether the probability of obtaining higher scores is greater in one population than the other). The format is either

wilcox.test(y ~ x, data)

Nonparametric tests of group differences

167

where y is numeric and x is a dichotomous variable, or

wilcox.test(y1, y2)

where y1 and y2 are the outcome variables for each group. The optional data argument refers to a matrix or data frame containing the variables. The default is a two-tailed test. You can add the option exact to produce an exact test, and alternative="less" or alternative="greater" to specify a directional test.

If you apply the Mann–Whitney U test to the question of incarceration rates from the previous section, you’ll get these results:

> with(UScrime, by(Prob, So, median))

So:

0

[1]

0.0382

--------------------

So:

1

[1]

0.0556

> wilcox.test(Prob ~ So, data=UScrime)

Wilcoxon rank sum test

data: Prob by So

W = 81, p-value = 8.488e-05

alternative hypothesis: true location shift is not equal to 0

Again, you can reject the hypothesis that incarceration rates are the same in Southern and non-Southern states (p < .001).

The Wilcoxon signed rank test provides a nonparametric alternative to the dependent sample t-test. It’s appropriate in situations where the groups are paired and the assumption of normality is unwarranted. The format is identical to the Mann–Whitney U test, but you add the paired=TRUE option. Let’s apply it to the unemployment question from the previous section:

>sapply(UScrime[c("U1","U2")], median) U1 U2

92 34

>with(UScrime, wilcox.test(U1, U2, paired=TRUE))

Wilcoxon signed rank test with continuity correction

data: U1 and U2

V = 1128, p-value = 2.464e-09

alternative hypothesis: true location shift is not equal to 0

Again, you’d reach the same conclusion reached with the paired t-test.

In this case, the parametric t-tests and their nonparametric equivalents reach the same conclusions. When the assumptions for the t-tests are reasonable, the

168

CHAPTER 7 Basic statistics

parametric tests will be more powerful (more likely to find a difference if it exists). The nonparametric tests are more appropriate when the assumptions are grossly unreasonable (for example, rank ordered data).

7.5.2Comparing more than two groups

When there are more than two groups to be compared, you must turn to other methods. Consider the state.x77 dataset from section 7.4. It contains population, income, illiteracy rate, life expectancy, murder rate, and high school graduation rate data for US states. What if you want to compare the illiteracy rates in four regions of the country (Northeast, South, North Central, and West)? This is called a one-way design, and there are both parametric and nonparametric approaches available to address the question.

If you can’t meet the assumptions of ANOVA designs, you can use nonparametric methods to evaluate group differences. If the groups are independent, a Kruskal– Wallis test will provide you with a useful approach. If the groups are dependent (for example, repeated measures or randomized block design), the Friedman test is more appropriate.

The format for the Kruskal–Wallis test is

kruskal.test(y ~ A, data)

where y is a numeric outcome variable and A is a grouping variable with two or more levels (if there are two levels, it’s equivalent to the Mann–Whitney U test). For the Friedman test, the format is

friedman.test(y ~ A | B, data)

where y is the numeric outcome variable, A is a grouping variable, and B is a blocking variable that identifies matched observations. In both cases, data is an option argument specifying a matrix or data frame containing the variables.

Let’s apply the Kruskal–Wallis test to the illiteracy question. First, you’ll have to add the region designations to the dataset. These are contained in the dataset state. region distributed with the base installation of R.

states <- as.data.frame(cbind(state.region, state.x77))

Now you can apply the test:

> kruskal.test(Illiteracy ~ state.region, data=states)

Kruskal-Wallis rank sum test

data: states$Illiteracy by states$state.region Kruskal-Wallis chi-squared = 22.7, df = 3, p-value = 4.726e-05

The significance test suggests that the illiteracy rate isn’t the same in each of the four regions of the country (p <.001).

Although you can reject the null hypothesis of no difference, the test doesn’t tell you which regions differ significantly from each other. To answer this question, you

Nonparametric tests of group differences

169

could compare groups two at a time using the Mann–Whitney U test. A more elegant approach is to apply a simultaneous multiple comparisons procedure that makes all pairwise comparisons, while controlling the type I error rate (the probability of finding a difference that isn’t there). The npmc package provides the nonparametric multiple comparisons you need.

To be honest, I’m stretching the definition of basic in the chapter title quite a bit, but because it fits well here, I hope you’ll bear with me. First, be sure to install the npmc package. The npmc() function in this package expects input to be a two-column data frame with a column named var (the dependent variable) and class (the grouping variable). The following listing contains the code you can use to accomplish this.

Listing 7.20 Nonparametric multiple comparisons

>class <- state.region

>var <- state.x77[,c("Illiteracy")]

>mydata <- as.data.frame(cbind(class, var))

>rm(class, var)

>library(npmc)

>summary(npmc(mydata), type="BF")

$'Data-structure'

 

 

 

 

 

 

 

group.index

class.level nobs

Northeast

 

1

 

Northeast

9

South

 

 

2

 

South

16

North

Central

 

3

North Central

12

West

 

 

4

 

West

13

$'Results of

the multiple Behrens-Fisher-Test'

 

cmp

effect

lower.cl upper.cl p.value.1s

p.value.2s

1

1-2

0.8750

0.66149

1.0885

0.000665

0.00135

2

1-3

0.1898

-0.13797

0.5176

0.999999

0.06547

3

1-4

0.3974

-0.00554

0.8004

0.998030

0.92004

4

2-3

0.0104

-0.02060

0.0414

1.000000

0.00000

5

2-4

0.1875

-0.07923

0.4542

1.000000

0.02113

6

3-4

0.5641

0.18740

0.9408

0.797198

0.98430

> aggregate(mydata, by=list(mydata$class), median)

 

Group.1 class

var

1

1

1

1.10

2

2

2

1.75

3

3

3

0.70

4

4

4

0.60

.

3

Pairwise group comparisons

Median illiteracy by class

The npmc call generates six statistical comparisons (Northeast versus South, Northeast versus North Central, Northeast versus West, South versus North Central, South versus West, and North Central versus West) .. You can see from the two-sided p-values (p.value.2s) that the South differs significantly from the other three regions, and that the other three regions don’t differ from each other. In 3 you see that the South has a higher median illiteracy rate. Note that npmc uses randomized values for integral calculations, so results differ slightly from call to call.

170

CHAPTER 7 Basic statistics

7.6Visualizing group differences

In sections 7.4 and 7.5, we looked at statistical methods for comparing groups. Examining group differences visually is also a crucial part of a comprehensive data analysis strategy. It allows you to assess the magnitude of the differences, identify any distributional characteristics that influence the results (such as skew, bimodality, or outliers), and evaluate the appropriateness of the test assumptions. R provides a wide range of graphical methods for comparing groups, including box plots (simple, notched, and violin), covered in section 6.5; overlapping kernel density plots, covered in section 6.4.1; and graphical methods of assessing test assumptions, discussed in chapter 9.

7.7Summary

In this chapter, we reviewed the functions in R that provide basic statistical summaries and tests. We looked at sample statistics and frequency tables, tests of independence and measures of association for categorical variables, correlations between quantitative variables (and their associated significance tests), and comparisons of two or more groups on a quantitative outcome variable.

In the next chapter, we’ll explore simple and multiple regression, where the focus is on understanding relationships between one (simple) or more than one (multiple) predictor variables and a predicted or criterion variable. Graphical methods will help you diagnose potential problems, evaluate and improve the fit of your models, and uncover unexpected gems of information in your data.

Part 3

Intermediate methods

While part 2 covered basic graphical and statistical methods, section 3 offers coverage of intermediate methods. We move from describing the relationship between two variables, to modeling the relationship between a numerical outcome variable and a set of numeric and/or categorical predictor variables.

Chapter 8 introduces regression methods for modeling the relationship between a numeric outcome variable and a set of one or more predictor variables. Modeling data is typically a complex, multistep, interactive process. Chapter 8 provides step-by-step coverage of the methods available for fitting linear models, evaluating their appropriateness, and interpreting their meaning.

Chapter 9 considers the analysis of basic experimental and quasi-experimental designs through the analysis of variance and its variants. Here we’re interested in how treatment combinations or conditions affect a numerical outcome variable. The chapter introduces the functions in R that are used to perform an analysis of variance, analysis of covariance, repeated measures analysis of variance, multifactor analysis of variance, and multivariate analysis of variance. Methods for assessing the appropriateness of these analyses, and visualizing the results are also discussed.

In designing experimental and quasi-experimental studies, it’s important to determine if the sample size is adequate for detecting the effects of interest (power analysis). Otherwise, why conduct the study? A detailed treatment of power analysis is provided in chapter 10. Starting with a discussion of hypothesis testing, the presentation focuses on how to use R functions to determine the sample size necessary to detect a treatment effect of a given size with a given degree of confidence. This can help you to plan studies that are likely to yield useful results.

Chapter 11 expands on the material in chapter 5 by covering the creation of graphs that help you to visualize relationships among two or more variables. This includes the various types of twoand three-dimensional scatter plots, scatter plot matrices, line plots, and bubble plots. It also introduces the useful, but less well-known, correlograms and mosaic plots.

The linear models described in chapters 8 and 9 assume that the outcome or response variable is not only numeric, but also randomly sampled from a normal distribution. There are situations where this distributional assumption is untenable. Chapter 12 presents analytic methods that work well in cases where data are sampled from unknown or mixed distributions, where sample sizes are small, where outliers are a problem, or where devising an appropriate test based on a theoretical distribution is mathematically intractable. They include both resampling and bootstrapping approaches—computer intensive methods that are powerfully implemented in R. The methods described in this chapter will allow you to devise hypothesis tests for data that do not fit traditional parametric assumptions.

After completing part 3, you’ll have the tools to analyze most common data analytic problems encountered in practice. And you will be able to create some gorgeous graphs!

Regression8

This chapter covers

Fitting and interpreting linear models Evaluating model assumptions Selecting among competing models

In many ways, regression analysis lives at the heart of statistics. It’s a broad term for a set of methodologies used to predict a response variable (also called a dependent, criterion, or outcome variable) from one or more predictor variables (also called independent or explanatory variables). In general, regression analysis can be used to identify the explanatory variables that are related to a response variable, to describe the form of the relationships involved, and to provide an equation for predicting the response variable from the explanatory variables.

For example, an exercise physiologist might use regression analysis to develop an equation for predicting the expected number of calories a person will burn while exercising on a treadmill. The response variable is the number of calories burned (calculated from the amount of oxygen consumed), and the predictor variables might include duration of exercise (minutes), percentage of time spent at their target heart rate, average speed (mph), age (years), gender, and body mass index (BMI).

173

174

CHAPTER 8 Regression

From a theoretical point of view, the analysis will help answer such questions as these:

What’s the relationship between exercise duration and calories burned? Is it linear or curvilinear? For example, does exercise have less impact on the number of calories burned after a certain point?

How does effort (the percentage of time at the target heart rate, the average walking speed) factor in?

Are these relationships the same for young and old, male and female, heavy and slim?

From a practical point of view, the analysis will help answer such questions as the following:

How many calories can a 30-year-old man with a BMI of 28.7 expect to burn if he walks for 45 minutes at an average speed of 4 miles per hour and stays within his target heart rate 80 percent of the time?

What’s the minimum number of variables you need to collect in order to accurately predict the number of calories a person will burn when walking?

How accurate will your prediction tend to be?

Because regression analysis plays such a central role in modern statistics, we'll cover it in some depth in this chapter. First, we’ll look at how to fit and interpret regression models. Next, we’ll review a set of techniques for identifying potential problems with these models and how to deal with them. Third, we’ll explore the issue of variable selection. Of all the potential predictor variables available, how do you decide which ones to include in your final model? Fourth, we’ll address the question of generalizability. How well will your model work when you apply it in the real world? Finally, we’ll look at the issue of relative importance. Of all the predictors in your model, which one is the most important, the second most important, and the least important?

As you can see, we’re covering a lot of ground. Effective regression analysis is an interactive, holistic process with many steps, and it involves more than a little skill. Rather than break it up into multiple chapters, I’ve opted to present this topic in a single chapter in order to capture this flavor. As a result, this will be the longest and most involved chapter in the book. Stick with it to the end and you’ll have all the tools you need to tackle a wide variety of research questions. Promise!

8.1The many faces of regression

The term regression can be confusing because there are so many specialized varieties (see table 8.1). In addition, R has powerful and comprehensive features for fitting regression models, and the abundance of options can be confusing as well. For example, in 2005, Vito Ricci created a list of over 205 functions in R that are used to generate regression analyses (http://cran.r-project.org/doc/contrib/Ricci-refcard- regression.pdf).

 

The many faces of regression

175

Table 8.1 Varieties of regression analysis

 

 

 

 

Type of regression

Typical use

 

 

 

 

Simple linear

Predicting a quantitative response variable from a quantitative

 

 

explanator y variable

 

Polynomial

Predicting a quantitative response variable from a quantitative

 

 

explanator y variable, where the relationship is modeled as an nth order

 

polynomial

 

Multiple linear

Predicting a quantitative response variable from two or more

 

 

explanator y variables

 

Multivariate

Predicting more than one response variable from one or more

 

 

explanator y variables

 

Logistic

Predicting a categorical response variable from one or more

 

 

explanator y variables

 

Poisson

Predicting a response variable representing counts from one or more

 

 

explanator y variables

 

Cox propor tional hazards

Predicting time to an event (death, failure, relapse) from one or more

 

 

explanator y variables

 

Time-series

Modeling time-series data with correlated errors

 

Nonlinear

Predicting a quantitative response variable from one or more

 

 

explanator y variables, where the form of the model is nonlinear

 

Nonparametric

Predicting a quantitative response variable from one or more

 

 

explanator y variables, where the form of the model is derived from the

 

data and not specified a priori

 

Robust

Predicting a quantitative response variable from one or more

 

 

explanator y variables using an approach that’s resistant to the effect of

 

influential obser vations

 

 

 

 

In this chapter, we’ll focus on regression methods that fall under the rubric of ordinary least squares (OLS) regression, including simple linear regression, polynomial regression, and multiple linear regression. OLS regression is the most common variety of statistical analysis today. Other types of regression models (including logistic regression and Poisson regression) will be covered in chapter 13.

8.1.1Scenarios for using OLS regression

In OLS regression, a quantitative dependent variable is predicted from a weighted sum of predictor variables, where the weights are parameters estimated from the data. Let’s take a look at a concrete example (no pun intended), loosely adapted from Fwa (2006).

An engineer wants to identify the most important factors related to bridge deterioration (such as age, traffic volume, bridge design, construction materials and methods, construction quality, and weather conditions) and determine the

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]