Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Robert I. Kabacoff - R in action

.pdf
Скачиваний:
89
Добавлен:
02.06.2015
Размер:
12.13 Mб
Скачать

286

CHAPTER 11 Intermediate graphs

Figure 11.21 Correlogram of the correlations among the variables in the mtcars data frame. The lower triangle contains smoothed best fit lines and confidence ellipses, and the upper triangle contains scatter plots. The diagonal panel contains minimum and maximum values. Rows and columns have been reordered using principal components analysis.

Why do the scatter plots look odd?

Several of the variables that are plotted in figure 11.21 have limited allowable values. For example, the number of gears is 3, 4, or 5. The number of cylinders is 4, 6, or 8. Both am (transmission type) and vs (V/S) are dichotomous. This explains the odd-looking scatter plots in the upper diagonal.

Always be careful that the statistical methods you choose are appropriate to the form of the data. Specifying these variables as ordered or unordered factors can ser ve as a useful check. When R knows that a variable is categorical or ordinal, it attempts to apply statistical methods that are appropriate to that level of measurement.

Correlograms

287

We’ll finish with one more example. The code

library(corrgram)

corrgram(mtcars, lower.panel=panel.shade, upper.panel=NULL, text.panel=panel.txt, main="Car Mileage Data (unsorted)")

produces the graph in figure 11.22. Here we’re using shading in the lower triangle, keeping the original variable order, and leaving the upper triangle blank.

Before moving on, I should point out that you can control the colors used by the corrgram() function. To do so, specify four colors in the colorRampPalette() function within the col.corrgram() function. Here’s an example:

library(corrgram)

col.corrgram <- function(ncol){ colorRampPalette(c("darkgoldenrod4", "burlywood1",

"darkkhaki", "darkgreen"))(ncol)} corrgram(mtcars, order=TRUE, lower.panel=panel.shade,

upper.panel=panel.pie, text.panel=panel.txt, main="A Corrgram (or Horse) of a Different Color")

Try it and see what you get.

Correlograms can be a useful way to examine large numbers of bivariate relationships among quantitative variables. Because they’re relatively new, the greatest challenge is to educate the recipient on how to interpret them.

To learn more, see Michael Friendly’s article “Corrgrams: Exploratory Displays for Correlation Matrices,” available at http://www.math.yorku.ca/SCS/Papers/corrgram.pdf.

Car Mileage Data (unsorted)

Figure 11.22 Correlogram of the correlations among the variables in the mtcars data frame. The lower triangle is shaded to represent the magnitude and direction of the correlations.

The variables are plotted in their original order.

288

CHAPTER 11 Intermediate graphs

11.4 Mosaic plots

Up to this point, we’ve been exploring methods of visualizing relationships among quantitative/continuous variables. But what if your variables are categorical? When you’re looking at a single categorical variable, you can use a bar or pie chart. If there are two categorical variables, you can look at a 3D bar chart (which, by the way, is not so easy to do in R). But what do you do if there are more than two categorical variables?

One approach is to use mosaic plots. In a mosaic plot, the frequencies in a multidimensional contingency table are represented by nested rectangular regions that are proportional to their cell frequency. Color and or shading can be used to represent residuals from a fitted model. For details, see Meyer, Zeileis and Hornick (2006), or Michael Friendly’s Statistical Graphics page ( http://datavis.ca). Steve Simon has created a good conceptual tutorial on how mosaic plots are created, available at http://www.childrensmercy.org/stats/definitions/mosaic.htm.

Mosaic plots can be created with the mosaic() function from the vcd library (there’s a mosaicplot() function in the basic installation of R, but I recommend you use the vcd package for its more extensive features). As an example, consider the Titanic dataset available in the base installation. It describes the number of passengers who survived or died, cross-classified by their class (1st, 2nd, 3rd, Crew), sex (Male, Female), and age (Child, Adult). This is a well-studied dataset. You can see the crossclassification using the following code:

> ftable(Titanic)

 

 

 

 

Survived

No Yes

Class

Sex

Age

 

 

1st

Male

Child

0

5

 

 

Adult

118

57

 

Female Child

0

1

 

 

Adult

4

140

2nd

Male

Child

0

11

 

 

Adult

154

14

 

Female Child

0

13

 

 

Adult

13

80

3rd

Male

Child

35

13

 

 

Adult

387

75

 

Female Child

17

14

 

 

Adult

89

76

Crew

Male

Child

0

0

 

 

Adult

670

192

 

Female Child

0

0

 

 

Adult

3

20

The mosaic() function can be invoked as

mosaic(table)

where table is a contingency table in array form, or

mosaic(formula, data=)

where formula is a standard R formula, and data specifies either a data frame or table. Adding the option shade=TRUE will color the figure based on Pearson residuals from

Mosaic plots

289

a fitted model (independence by default) and the option legend=TRUE will display a legend for these residuals.

For example, both

library(vcd)

mosaic(Titanic, shade=TRUE, legend=TRUE)

and

library(vcd)

mosaic(~Class+Sex+Age+Survived, data=Titanic, shade=TRUE, legend=TRUE)

will produce the graph shown in figure 11.23. The formula version gives you greater control over the selection and placement of variables in the graph.

There’s a great deal of information packed into this one picture. For example, as one moves from crew to first class, the survival rate increases precipitously. Most children were in third and second class. Most females in first class survived, whereas only about half the females in third class survived. There were few females in the crew, causing the Survived labels (No, Yes at the bottom of the chart) to overlap for this group. Keep looking and you’ll see many more interesting facts. Remember to look at the relative widths and heights of the rectangles. What else can you learn about that night?

2nd 1st

Class 3rd

Crew

Sex

Male

No

Survived

Female Child

Adult

AdultChild

Child

AdultAge

Child

Adult

Yes NoYes

Pearson residuals:

25.7

4.0

2.0

0.0 -2.0 -4.0

-10.8

p-value = < 2.22e-16

Figure 11.23 Mosaic plot describing Titanic survivors by class, sex, and age

290

CHAPTER 11 Intermediate graphs

Extended mosaic plots add color and shading to represent the residuals from a fitted model. In this example, the blue shading indicates cross-classifications that occur more often than expected, assuming that survival is unrelated to class, gender, and age. Red shading indicates cross-classifications that occur less often than expected under the independence model. Be sure to run the example so that you can see the results in color. The graph indicates that more first-class women survived and more male crew members died than would be expected under an independence model. Fewer third-class men survived than would be expected if survival was independent of class, gender, and age. If you would like to explore mosaic plots in greater detail, try running example(mosaic).

11.5 Summary

In this chapter, we considered a wide range of techniques for displaying relationships among two or more variables. This included the use of 2D and 3D scatter plots, scatter plot matrices, bubble plots, line plots, correlograms, and mosaic plots. Some of these methods are standard techniques, while some are relatively new.

Taken together with methods that allow you to customize graphs (chapter 3), display univariate distributions (chapter 6), explore regression models (chapter 8), and visualize group differences (chapter 9), you now have a comprehensive toolbox for visualizing and extracting meaning from your data.

In later chapters, you’ll expand your skills with additional specialized techniques, including graphics for latent variable models (chapter 14), methods for visualizing missing data patterns (chapter 15), and techniques for creating graphs that are conditioned on one or more variables (chapter 16).

In the next chapter, we’ll explore resampling statistics and bootstrapping. These are computer intensive methods that allow you to analyze data in new and unique ways.

Resampling12statistics

and bootstrapping

This chapter covers

Understanding the logic of permutation tests

Applying permutation tests to linear models

Using bootstrapping to obtain confidence inter vals

In chapters 7, 8, and 9, we reviewed statistical methods that test hypotheses and estimate confidence intervals for population parameters by assuming that the observed data is sampled from a normal distribution or some other well-known theoretical distribution. But there will be many cases in which this assumption is unwarranted. Statistical approaches based on randomization and resampling can be used in cases where the data is sampled from unknown or mixed distributions, where sample sizes are small, where outliers are a problem, or where devising an appropriate test based on a theoretical distribution is too complex and mathematically intractable.

In this chapter, we’ll explore two broad statistical approaches that use randomization: permutation tests and bootstrapping. Historically, these methods were only available to experienced programmers and expert statisticians. Contributed packages in R now make them readily available to a wider audience of data analysts.

291

Treatment B
57
64
55
62
65
Treatment A
40
57
45
55
58
Table 12.1 Hypothetical two-group problem

292

CHAPTER 12 Resampling statistics and bootstrapping

We’ll also revisit problems that were initially analyzed using traditional methods (for example, t-tests, chi-square tests, ANOVA, regression) and see how they can be approached using these robust, computer-intensive methods. To get the most out of section 12.2, be sure to read chapter 7 first. Chapters 8 and 9 serve as prerequisites for section 12.3. Other sections can be read on their own.

12.1 Permutation tests

Permutation tests, also called randomization or re-randomization tests, have been around for decades, but it took the advent of high-speed computers to make them practically available.

To understand the logic of a permutation test, consider the following hypothetical problem. Ten subjects have been randomly assigned to one of two treatment conditions (A or B) and an outcome variable (score) has been recorded. The results of the experiment are presented in table 12.1.

The data are also displayed in the strip chart in figure 12.1. Is there enough evidence to conclude that the treatments differ in their impact?

In a parametric approach, you might assume that the data are sampled from normal populations with equal variances and apply a two-tailed independent groups t-test. The null hypothesis is that the population mean for treatment A is equal to the population mean for treatment B. You’d calculate a t-statistic from the data and compare it to the theoretical distribution. If the observed t-statistic is sufficiently extreme, say outside the middle 95 percent of values in the theoretical distribution, you’d reject the null hypothesis and declare that the population means for the two groups are unequal at the 0.05 level of significance.

Treatment

B

A

40

45

50

55

60

65

score

Figure 12.1 Strip chart of the hypothetical treatment data in table 12.1

Permutation tests

293

A permutation test takes a different approach. If the two treatments are truly equivalent, the label (Treatment A or Treatment B) assigned to an observed score is arbitrary. To test for differences between the two treatments, we could follow these steps:

1Calculate the observed t-statistic, as in the parametric approach; call this t0.

2Place all 10 scores in a single group.

3Randomly assign five scores to Treatment A and five scores to Treatment B.

4Calculate and record the new observed t-statistic.

5Repeat steps 3–4 for every possible way of assigning five scores to Treatment A and five scores to Treatment B. There are 252 such possible arrangements.

6Arrange the 252 t-statistics in ascending order. This is the empirical distribution, based on (or conditioned on) the sample data.

7If t0 falls outside the middle 95 percent of the empirical distribution, reject the null hypothesis that the population means for the two treatment groups are equal at the 0.05 level of significance.

Notice that the same t-statistic is calculated in both the permutation and parametric approaches. But instead of comparing the statistic to a theoretical distribution in order to determine if it was extreme enough to reject the null hypothesis, it’s compared to an empirical distribution created from permutations of the observed data. This logic can be extended to most classical statistical tests and linear models.

In the previous example, the empirical distribution was based on all possible permutations of the data. In such cases, the permutation test is called an “exact” test. As the sample sizes increase, the time required to form all possible permutations can become prohibitive. In such cases, you can use Monte Carlo simulation to sample from all possible permutations. Doing so provides an approximate test.

If you’re uncomfortable assuming that the data is normally distributed, concerned about the impact of outliers, or feel that the dataset is too small for standard parametric approaches, a permutation test provides an excellent alternative.

R has some of the most comprehensive and sophisticated packages for performing permutation tests currently available. The remainder of this section focuses on two contributed packages: the coin package and the lmPerm package. Be sure to install them before first use:

install.packages(c(“coin”,"lmPerm"))

The coin package provides a comprehensive framework for permutation tests applied to independence problems, whereas the lmPerm package provides permutation tests for ANOVA and regression designs. We’ll consider each in turn, and end the section with a quick review of other permutation packages available in R.

Before moving on, it’s important to remember that permutation tests use pseudorandom numbers to sample from all possible permutations (when performing an approximate test). Therefore, the results will change each time the test is performed. Setting the random number seed in R allows you to fix the random numbers generated.

294

CHAPTER 12 Resampling statistics and bootstrapping

This is particularly useful when you want to share your examples with others, because results will always be the same if the calls are made with the same seed. Setting the random number seed to 1234 (that is, set.seed(1234)) will allow you to replicate the results presented in this chapter.

12.2 Permutation test with the coin package

The coin package provides a general framework for applying permutation tests to independence problems. With this package, we can answer such questions as

Are responses independent of group assignment?

Are two numeric variables independent?

Are two categorical variables independent?

Using convenience functions provided in the package (see table 12.2), we can perform permutation test equivalents for most of the traditional statistical tests covered in chapter 7.

Table 12.2 coin functions providing permutation test alternatives to traditional tests

Test

coin function

 

 

Twoand K-sample permutation test

oneway_test(y ~ A)

Twoand K-sample permutation test with a

oneway_test(y ~ A | C)

stratification (blocking) factor

 

Wilcoxon–Mann–Whitney rank sum test

wilcox_test(y ~ A )

Kruskal–Wallis test

kruskal_test(y ~ A)

Person’s chi-square test

chisq_test(A ~ B)

Cochran–Mantel–Haenszel test

cmh_test(A ~ B | C)

Linear-by-linear association test

lbl_test(D ~ E)

Spearman’s test

spearman_test(y ~ x)

Friedman test

friedman_test(y ~ A | C)

Wilcoxon–Signed–Rank test

wilcoxsign_test(y1 ~ y2)

 

 

In the coin function column, y and x are numeric variables, A and B are categorical factors, C is a categorical blocking variable, D and E are ordered factors, and y1 and y2 are matched numeric variables.

Each of the functions listed in table 12.2 take the form

function_name( formula, data, distribution= )

where

formula describes the relationship among variables to be tested. Examples are given in the table.

data identifies a data frame.

distribution specifies how the empirical distribution under the null hypothesis should be derived. Possible values are exact, asymptotic, and approximate.

Permutation test with the coin package

295

If distribution="exact", the distribution under the null hypothesis is computed exactly (that is, from all possible permutations). The distribution can also be approximated by its asymptotic distribution (distribution="asymptotic") or via Monte Carlo resampling (distribution="approximate(B=#)"), where # indicates the number of replications used to approximate the exact distribution. At present, distribution="exact" is only available for two-sample problems.

NOTE In the coin package, categorical variables and ordinal variables must be coded as factors and ordered factors, respectively. Additionally, the data must be stored in a data frame.

In the remainder of this section, we’ll apply several of the permutation tests described in table 12.2 to problems from previous chapters. This will allow you to compare the results with more traditional parametric and nonparametric approaches. We’ll end this discussion of the coin package by considering advanced extensions.

12.2.1Independent two-sample and k-sample tests

To begin, compare an independent samples t-test with a one-way exact test applied to the hypothetical data in table 12.2. The results are given in the following listing.

Listing 12.1 t-test versus one-way permutation test for the hypothetical data

>library(coin)

>score <- c(40, 57, 45, 55, 58, 57, 64, 55, 62, 65)

>treatment <- factor(c(rep("A",5), rep("B",5)))

>mydata <- data.frame(treatment, score)

>t.test(score~treatment, data=mydata, var.equal=TRUE)

Two Sample t-test

data: score by treatment

t = -2.3, df = 8, p-value = 0.04705

alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:

-19.04 -0.16 sample estimates:

mean in group A mean in group B

5161

>oneway_test(score~treatment, data=mydata, distribution="exact") Exact 2-Sample Permutation Test

data: score by treatment (A, B) Z = -1.9, p-value = 0.07143

alternative hypothesis: true mu is not equal to 0

The traditional t-test indicates a significant group difference (p < .05), whereas the exact test doesn’t (p > 0.072). With only 10 observations, l’d be more inclined to trust the results of the permutation test and attempt to collect more data before reaching a final conclusion.

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]