Robert I. Kabacoff - R in action
.pdf286 |
CHAPTER 11 Intermediate graphs |
Figure 11.21 Correlogram of the correlations among the variables in the mtcars data frame. The lower triangle contains smoothed best fit lines and confidence ellipses, and the upper triangle contains scatter plots. The diagonal panel contains minimum and maximum values. Rows and columns have been reordered using principal components analysis.
Why do the scatter plots look odd?
Several of the variables that are plotted in figure 11.21 have limited allowable values. For example, the number of gears is 3, 4, or 5. The number of cylinders is 4, 6, or 8. Both am (transmission type) and vs (V/S) are dichotomous. This explains the odd-looking scatter plots in the upper diagonal.
Always be careful that the statistical methods you choose are appropriate to the form of the data. Specifying these variables as ordered or unordered factors can ser ve as a useful check. When R knows that a variable is categorical or ordinal, it attempts to apply statistical methods that are appropriate to that level of measurement.
Correlograms |
287 |
We’ll finish with one more example. The code
library(corrgram)
corrgram(mtcars, lower.panel=panel.shade, upper.panel=NULL, text.panel=panel.txt, main="Car Mileage Data (unsorted)")
produces the graph in figure 11.22. Here we’re using shading in the lower triangle, keeping the original variable order, and leaving the upper triangle blank.
Before moving on, I should point out that you can control the colors used by the corrgram() function. To do so, specify four colors in the colorRampPalette() function within the col.corrgram() function. Here’s an example:
library(corrgram)
col.corrgram <- function(ncol){ colorRampPalette(c("darkgoldenrod4", "burlywood1",
"darkkhaki", "darkgreen"))(ncol)} corrgram(mtcars, order=TRUE, lower.panel=panel.shade,
upper.panel=panel.pie, text.panel=panel.txt, main="A Corrgram (or Horse) of a Different Color")
Try it and see what you get.
Correlograms can be a useful way to examine large numbers of bivariate relationships among quantitative variables. Because they’re relatively new, the greatest challenge is to educate the recipient on how to interpret them.
To learn more, see Michael Friendly’s article “Corrgrams: Exploratory Displays for Correlation Matrices,” available at http://www.math.yorku.ca/SCS/Papers/corrgram.pdf.
Car Mileage Data (unsorted)
Figure 11.22 Correlogram of the correlations among the variables in the mtcars data frame. The lower triangle is shaded to represent the magnitude and direction of the correlations.
The variables are plotted in their original order.
290 |
CHAPTER 11 Intermediate graphs |
Extended mosaic plots add color and shading to represent the residuals from a fitted model. In this example, the blue shading indicates cross-classifications that occur more often than expected, assuming that survival is unrelated to class, gender, and age. Red shading indicates cross-classifications that occur less often than expected under the independence model. Be sure to run the example so that you can see the results in color. The graph indicates that more first-class women survived and more male crew members died than would be expected under an independence model. Fewer third-class men survived than would be expected if survival was independent of class, gender, and age. If you would like to explore mosaic plots in greater detail, try running example(mosaic).
11.5 Summary
In this chapter, we considered a wide range of techniques for displaying relationships among two or more variables. This included the use of 2D and 3D scatter plots, scatter plot matrices, bubble plots, line plots, correlograms, and mosaic plots. Some of these methods are standard techniques, while some are relatively new.
Taken together with methods that allow you to customize graphs (chapter 3), display univariate distributions (chapter 6), explore regression models (chapter 8), and visualize group differences (chapter 9), you now have a comprehensive toolbox for visualizing and extracting meaning from your data.
In later chapters, you’ll expand your skills with additional specialized techniques, including graphics for latent variable models (chapter 14), methods for visualizing missing data patterns (chapter 15), and techniques for creating graphs that are conditioned on one or more variables (chapter 16).
In the next chapter, we’ll explore resampling statistics and bootstrapping. These are computer intensive methods that allow you to analyze data in new and unique ways.
Permutation tests |
293 |
A permutation test takes a different approach. If the two treatments are truly equivalent, the label (Treatment A or Treatment B) assigned to an observed score is arbitrary. To test for differences between the two treatments, we could follow these steps:
1Calculate the observed t-statistic, as in the parametric approach; call this t0.
2Place all 10 scores in a single group.
3Randomly assign five scores to Treatment A and five scores to Treatment B.
4Calculate and record the new observed t-statistic.
5Repeat steps 3–4 for every possible way of assigning five scores to Treatment A and five scores to Treatment B. There are 252 such possible arrangements.
6Arrange the 252 t-statistics in ascending order. This is the empirical distribution, based on (or conditioned on) the sample data.
7If t0 falls outside the middle 95 percent of the empirical distribution, reject the null hypothesis that the population means for the two treatment groups are equal at the 0.05 level of significance.
Notice that the same t-statistic is calculated in both the permutation and parametric approaches. But instead of comparing the statistic to a theoretical distribution in order to determine if it was extreme enough to reject the null hypothesis, it’s compared to an empirical distribution created from permutations of the observed data. This logic can be extended to most classical statistical tests and linear models.
In the previous example, the empirical distribution was based on all possible permutations of the data. In such cases, the permutation test is called an “exact” test. As the sample sizes increase, the time required to form all possible permutations can become prohibitive. In such cases, you can use Monte Carlo simulation to sample from all possible permutations. Doing so provides an approximate test.
If you’re uncomfortable assuming that the data is normally distributed, concerned about the impact of outliers, or feel that the dataset is too small for standard parametric approaches, a permutation test provides an excellent alternative.
R has some of the most comprehensive and sophisticated packages for performing permutation tests currently available. The remainder of this section focuses on two contributed packages: the coin package and the lmPerm package. Be sure to install them before first use:
install.packages(c(“coin”,"lmPerm"))
The coin package provides a comprehensive framework for permutation tests applied to independence problems, whereas the lmPerm package provides permutation tests for ANOVA and regression designs. We’ll consider each in turn, and end the section with a quick review of other permutation packages available in R.
Before moving on, it’s important to remember that permutation tests use pseudorandom numbers to sample from all possible permutations (when performing an approximate test). Therefore, the results will change each time the test is performed. Setting the random number seed in R allows you to fix the random numbers generated.
294 |
CHAPTER 12 Resampling statistics and bootstrapping |
This is particularly useful when you want to share your examples with others, because results will always be the same if the calls are made with the same seed. Setting the random number seed to 1234 (that is, set.seed(1234)) will allow you to replicate the results presented in this chapter.
12.2 Permutation test with the coin package
The coin package provides a general framework for applying permutation tests to independence problems. With this package, we can answer such questions as
■Are responses independent of group assignment?
■Are two numeric variables independent?
■Are two categorical variables independent?
Using convenience functions provided in the package (see table 12.2), we can perform permutation test equivalents for most of the traditional statistical tests covered in chapter 7.
Table 12.2 coin functions providing permutation test alternatives to traditional tests
Test |
coin function |
|
|
Twoand K-sample permutation test |
oneway_test(y ~ A) |
Twoand K-sample permutation test with a |
oneway_test(y ~ A | C) |
stratification (blocking) factor |
|
Wilcoxon–Mann–Whitney rank sum test |
wilcox_test(y ~ A ) |
Kruskal–Wallis test |
kruskal_test(y ~ A) |
Person’s chi-square test |
chisq_test(A ~ B) |
Cochran–Mantel–Haenszel test |
cmh_test(A ~ B | C) |
Linear-by-linear association test |
lbl_test(D ~ E) |
Spearman’s test |
spearman_test(y ~ x) |
Friedman test |
friedman_test(y ~ A | C) |
Wilcoxon–Signed–Rank test |
wilcoxsign_test(y1 ~ y2) |
|
|
In the coin function column, y and x are numeric variables, A and B are categorical factors, C is a categorical blocking variable, D and E are ordered factors, and y1 and y2 are matched numeric variables.
Each of the functions listed in table 12.2 take the form
function_name( formula, data, distribution= )
where
■formula describes the relationship among variables to be tested. Examples are given in the table.
■data identifies a data frame.
■distribution specifies how the empirical distribution under the null hypothesis should be derived. Possible values are exact, asymptotic, and approximate.
Permutation test with the coin package |
295 |
If distribution="exact", the distribution under the null hypothesis is computed exactly (that is, from all possible permutations). The distribution can also be approximated by its asymptotic distribution (distribution="asymptotic") or via Monte Carlo resampling (distribution="approximate(B=#)"), where # indicates the number of replications used to approximate the exact distribution. At present, distribution="exact" is only available for two-sample problems.
NOTE In the coin package, categorical variables and ordinal variables must be coded as factors and ordered factors, respectively. Additionally, the data must be stored in a data frame.
In the remainder of this section, we’ll apply several of the permutation tests described in table 12.2 to problems from previous chapters. This will allow you to compare the results with more traditional parametric and nonparametric approaches. We’ll end this discussion of the coin package by considering advanced extensions.
12.2.1Independent two-sample and k-sample tests
To begin, compare an independent samples t-test with a one-way exact test applied to the hypothetical data in table 12.2. The results are given in the following listing.
Listing 12.1 t-test versus one-way permutation test for the hypothetical data
>library(coin)
>score <- c(40, 57, 45, 55, 58, 57, 64, 55, 62, 65)
>treatment <- factor(c(rep("A",5), rep("B",5)))
>mydata <- data.frame(treatment, score)
>t.test(score~treatment, data=mydata, var.equal=TRUE)
Two Sample t-test
data: score by treatment
t = -2.3, df = 8, p-value = 0.04705
alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:
-19.04 -0.16 sample estimates:
mean in group A mean in group B
5161
>oneway_test(score~treatment, data=mydata, distribution="exact") Exact 2-Sample Permutation Test
data: score by treatment (A, B) Z = -1.9, p-value = 0.07143
alternative hypothesis: true mu is not equal to 0
The traditional t-test indicates a significant group difference (p < .05), whereas the exact test doesn’t (p > 0.072). With only 10 observations, l’d be more inclined to trust the results of the permutation test and attempt to collect more data before reaching a final conclusion.