Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
540
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

 

 

 

Visualizing group differences

163

median

0.60

0.70

 

1.1

1.75

 

 

 

 

mad

0.15

0.15

 

0.3

0.59

 

 

 

 

Multiple Comparisons (Wilcoxon Rank Sum Tests)

 

 

 

Probability Adjustment = holm

 

 

 

 

 

 

 

 

 

 

 

d Pairwise comparisons

 

 

 

Group.1

Group.2

W

 

p

 

 

 

 

1

 

West

North Central

88

8.7e-01

 

 

 

 

2

 

West

Northeast 46

8.7e-01

 

 

 

 

3

 

West

South 39

1.8e-02

*

 

 

 

4

North

Central

Northeast 20

5.4e-02

.

 

 

 

5

North

Central

South

2

8.1e-05 ***

 

6

Northeast

South 18

1.2e-02

*

 

 

 

---

 

 

 

 

 

 

 

 

 

Signif. codes:

0 '***' 0.001

'**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 

The source() function downloads and executes the R script defining the wmc() function b. The function’s format is wmc(y ~ A, data, method), where y is a numeric outcome variable, A is a grouping variable, data is the data frame containing these variables, and method is the approach used to limit Type I errors. Listing 7.17 uses an adjustment method developed by Holm (1979). It provides strong control of the fam- ily-wise error rate (the probability of making one or more Type I errors in a set of comparisons). See help(p.adjust) for a description of the other methods available.

The wmc() function first provides the sample sizes, medians, and median absolute deviations for each group c. The West has the lowest illiteracy rate, and the South has the highest. The function then generates six statistical comparisons (West versus North Central, West versus Northeast, West versus South, North Central versus Northeast, North Central versus South, and Northeast versus South) d. You can see from the two-sided p-values (p) that the South differs significantly from the other three regions and that the other three regions don’t differ from each other at a p < .05 level.

Nonparametric multiple comparisons are a useful set of techniques that aren’t easily accessible in R. In chapter 21, you’ll have an opportunity to expand the wmc() function into a fully developed package that includes error checking and informative graphics.

7.6Visualizing group differences

In sections 7.4 and 7.5, we looked at statistical methods for comparing groups. Examining group differences visually is also a crucial part of a comprehensive data-analysis strategy. It allows you to assess the magnitude of the differences, identify any distributional characteristics that influence the results (such as skew, bimodality, or outliers), and evaluate the appropriateness of the test assumptions. R provides a wide range of graphical methods for comparing groups, including box plots (simple, notched, and violin), covered in section 6.5; overlapping kernel density plots, covered in section 6.4.1; and graphical methods for visualizing outcomes in an ANOVA framework, discussed in chapter 9. Advanced methods for visualizing group differences, including grouping and faceting, are discussed in chapter 19.

164

CHAPTER 7 Basic statistics

7.7Summary

In this chapter, we reviewed the functions in R that provide basic statistical summaries and tests. We looked at sample statistics and frequency tables, tests of independence and measures of association for categorical variables, correlations between quantitative variables (and their associated significance tests), and comparisons of two or more groups on a quantitative outcome variable.

In the next chapter, we’ll explore simple and multiple regression, where the focus is on understanding relationships between one (simple) or more than one (multiple) predictor variables and a predicted or criterion variable. Graphical methods will help you diagnose potential problems, evaluate and improve the fit of your models, and uncover unexpected gems of information in your data.

Part 3

Intermediate methods

Whereas part 2 of this book covered basic graphical and statistical methods, part 3 discusses intermediate methods. We move from describing the relationship between two variables to, in chapter 8, using regression models to model the relationship between a numerical outcome variable and a set of numeric and/or categorical predictor variables. Modeling data is typically a complex, multistep, interactive process. Chapter 8 provides step-by-step coverage of the methods available for fitting linear models, evaluating their appropriateness, and interpreting their meaning.

Chapter 9 considers the analysis of basic experimental and quasi-experimen- tal designs through the analysis of variance and its variants. Here we’re interested in how treatment combinations or conditions affect a numerical outcome variable. The chapter introduces the functions in R that are used to perform an analysis of variance, analysis of covariance, repeated measures analysis of variance, multifactor analysis of variance, and multivariate analysis of variance. Methods for assessing the appropriateness of these analyses and visualizing the results are also discussed.

In designing experimental and quasi-experimental studies, it’s important to determine whether the sample size is adequate for detecting the effects of interest (power analysis). Otherwise, why conduct the study? A detailed treatment of power analysis is provided in chapter 10. Starting with a discussion of hypothesis testing, the presentation focuses on how to use R functions to determine the sample size necessary to detect a treatment effect of a given size with a given degree of confidence. This can help you to plan studies that are likely to yield useful results.

166

Intermediate methods

CHAPTER

Chapter 11 expands on the material in chapter 5 by covering the creation of graphs that help you to visualize relationships among two or more variables. This includes the various types of 2D and 3D scatter plots, scatter-plot matrices, line plots, and bubble plots. It also introduces the very useful, but less well-known, corrgrams, and mosaic plots.

The linear models described in chapters 8 and 9 assume that the outcome or response variable is not only numeric, but also randomly sampled from a normal distribution. There are situations where this distributional assumption is untenable. Chapter 12 presents analytic methods that work well in cases where data is sampled from unknown or mixed distributions, where sample sizes are small, where outliers are a problem, or where devising an appropriate test based on a theoretical distribution is mathematically intractable. They include both resampling and bootstrapping approaches—computer-intensive methods that are powerfully implemented in R. The methods described in this chapter will allow you to devise hypothesis tests for data that don’t fit traditional parametric assumptions.

After completing part 3, you’ll have the tools to analyze most common dataanalytic problems encountered in practice. And you’ll be able to create some gorgeous graphs!

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]