Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
546
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

Grouping

75

70

height

65

60

Bass 2

Bass 1

Tenor 2

Tenor 1

Alto 2

Alto 1

Soprano 2 Soprano 1

voice.part

447

Figure 19.7 A combined violin and box plot graph of singer heights by voice part

In the remainder of this chapter, you’ll use geoms to create a wide range of graph types. Let’s start with grouping—the representation of more than one group of observations in a single graph.

19.4 Grouping

In order to understand data, it’s often helpful to plot two or more groups of observations on the same graph. In R, the groups are usually defined as the levels of a categorical variable (factor). Grouping is accomplished in ggplot2 graphs by associating one or more grouping variables with visual characteristics such as shape, color, fill, size, and line type. The aes() function in the ggplot() statement assigns variables to roles (visual characteristics of the plot), so this is a natural place to assign grouping variables.

Let’s use grouping to explore the Salaries dataset. The dataframe contains information on the salaries of university professors collected during the 2008–2009 academic year. Variables include rank (AsstProf, AssocProf, Prof), sex (Female, Male), yrs.since.phd (years since Ph.D.), yrs.service (years of service), and salary (nine-month salary in dollars).

First, you can ask how salaries vary by academic rank. The code

data(Salaries, package="car") library(ggplot2)

ggplot(data=Salaries, aes(x=salary, fill=rank)) + geom_density(alpha=.3)

448

density

CHAPTER 19 Advanced graphics with ggplot2

4e−05

3e−05

rank

AsstProf AssocProf

2e−05 Prof

1e−05

 

 

 

Figure 19.8 Density plots

 

 

 

0e+00

 

 

of university salaries, grouped

 

 

 

by academic rank

50000

100000

150000

200000

salary

plots three density curves in the same graph (one for each level of academic rank) and distinguishes them by fill color. The fills are set to be somewhat transparent (alpha) so that the overlapping curves don’t obscure each other. The colors also combine to improve visualization in join areas. The plot is given is figure 19.8. Note that a legend is produced automatically. In section 19.7.2, you’ll learn how to customize the legend generated for grouped data.

Salary increases by rank, but there is significant overlap, with some associate and full professors earning the same as assistant professors. As rank increases, so does the range of salaries. This is especially true for full professors, who have wide variation in their incomes. Placing all three distributions in the same graph facilitates comparisons among the groups.

Next, let’s plot the relationship between years since Ph.D. and salary, grouping by sex and rank:

ggplot(Salaries, aes(x=yrs.since.phd, y=salary, color=rank, shape=sex)) + geom_point()

In the resulting graph (figure 19.9), academic rank is represented by point color (assistant professors in red, associate professors in green, and full professors in blue). Additionally, sex is indicated by point shape (circles are females and triangles are men). If you’re looking at a greyscale image, the color differences can be difficult to see; try running the code yourself. Note that reasonable legends are again produced

salary

 

 

Grouping

449

200000

 

 

 

 

 

rank

 

 

 

AsstProf

 

 

 

AssocProf

 

150000

 

Prof

 

 

 

 

 

 

sex

 

 

 

Female

 

 

 

Male

 

100000

 

 

 

 

 

Figure 19.9

Scatterplot of

 

 

years since graduation and

 

 

salary. Academic rank is

 

 

represented by color, and sex

50000

 

is represented by shape.

 

 

 

0

20

40

 

 

 

yrs.since.phd

 

automatically. Here you can see that income increases with years since graduation, but the relationship is by no means linear.

Finally, you can visualize the number of professors by rank and sex using a grouped bar chart. The following code provides three bar-chart variations, displayed in figure 19.10:

ggplot(Salaries, aes(x=rank, fill=sex)) + geom_bar(position="stack") + labs(title='position="stack"')

ggplot(Salaries, aes(x=rank, fill=sex)) + geom_bar(position="dodge") + labs(title='position="dodge"')

ggplot(Salaries, aes(x=rank, fill=sex)) + geom_bar(position="fill") + labs(title='position="fill"')

Each of the plots in figure 19.10 emphasizes different aspects of the data. It’s clear from the first two graphs that there are many more full professors than members of other ranks. Additionally, there are more female full professors than female assistant or associate professors. The third graph indicates that the relative percentage of women to men in the full-professor group is less than in the other two groups, even though the total number of women is greater.

450

CHAPTER 19 Advanced graphics with ggplot2

200

count

100

0

position="stack"

 

position="dodge"

 

250

1.00

 

200

 

 

 

0.75

count

150

count

 

 

 

0.50

 

100

 

 

 

0.25

 

50

 

 

0

0.00

position="fill"

sex

Female

Male

AsstProf AssocProf

Prof

AsstProf AssocProf

Prof

AsstProf AssocProf

Prof

rank

 

rank

 

rank

 

Figure 19.10 Three versions of a grouped bar chart. Each displays the number of professors by academic rank and sex.

Note that the label on the y-axis for the third graph isn’t correct. It should say Proportion rather than count. You can correct this by adding y="Proportion" to the labs() function.

Options can be used in different ways, depending on whether they occur inside or outside the aes() function. Look at the following examples and try to guess what they do:

ggplot(Salaries, aes(x=rank, fill=sex))+ geom_bar() ggplot(Salaries, aes(x=rank)) + geom_bar(fill="red") ggplot(Salaries, aes(x=rank, fill="red")) + geom_bar()

In the first example, sex is a variable represented by fill color in the bar graph. In the second example, each bar is filled with the color red. In the third example, ggplot2 assumes that "red" is the name of a variable, and you get unexpected (and undesirable) results. In general, variables should go inside aes(), and assigned constants should go outside aes().

19.5 Faceting

Sometimes relationships are clearer if groups appear in side-by-side graphs rather than overlapping in a single graph. You can create trellis graphs (called faceted graphs in ggplot2) using the facet_wrap() and facet_grid() functions. The syntax is given in table 19.4, where var, rowvar, and colvar are factors.

Table 19.4 ggplot2 facet functions

Syntax

Results

 

 

facet_wrap(~var, ncol=n)

Separate plots for each level of var arranged into n columns

facet_wrap(~var, nrow=n)

Separate plots for each level of var arranged into n rows

 

 

 

 

Faceting

451

Table 19.4 ggplot2 facet functions

 

 

 

 

 

 

Syntax

 

Results

 

 

 

 

 

facet_grid(rowvar~colvar)

 

Separate plots for each combination of rowvar and colvar,

 

 

 

where rowvar represents rows and colvar represents columns

facet_grid(rowvar~.)

 

Separate plots for each level of rowvar, arranged as a single

 

 

 

column

 

facet_grid(.~colvar)

 

Separate plots for each level of colvar, arranged as a single row

 

 

 

 

Going back to the choral example, you can a faceted graph using the following code:

data(singer, package="lattice") library(ggplot2)

ggplot(data=singer, aes(x=height)) + geom_histogram() + facet_wrap(~voice.part, nrow=4)

The resulting plot (figure 19.11) displays the distribution of singer heights by voice part. Separating the eight distributions into their own small, side-by-side plots makes them easier to compare.

As a second example, let’s create a graph that has faceting and grouping:

library(ggplot2)

ggplot(Salaries, aes(x=yrs.since.phd, y=salary, color=rank, shape=rank)) + geom_point() + facet_grid(.~sex)

count

Bass 2

 

Bass 1

15

10

5

0

Tenor 2

 

Tenor 1

15

10

5

0

Alto 2

 

Alto 1

15

10

5

0

 

 

Soprano 2

 

 

 

Soprano 1

 

15

 

 

 

 

 

 

 

10

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

60

65

70

75

60

65

70

75

 

 

 

 

height

 

 

 

Figure 19.11 Faceted graph showing the distribution (histogram) of singer heights by voice part

452

200000

salary

150000

 

 

100000

 

50000

0

CHAPTER 19 Advanced graphics with ggplot2

Female

Male

rank

AsstProf

AssocProf

Prof

Figure 19.12 Scatterplot of years since graduation and salary. Academic rank is represented by color and shape, and sex is faceted.

20

40

0

20

40

yrs.since.phd

The resulting graph is presented in 19.12. It contains the same information, but separating the plot into facets makes it somewhat easier to read.

Finally, try displaying the height distribution of choral members in the singer dataset separately for each voice part, using kernel-density plots arranged horizontally. Give each a different color. One solution is as follows:

data(singer, package="lattice") library(ggplot2)

ggplot(data=singer, aes(x=height, fill=voice.part)) + geom_density() +

facet_grid(voice.part~.)

The result is displayed in figure 19.13.

Note that the horizontal arrangement facilitates comparisons among the groups. The colors aren’t strictly necessary, but they can aid in distinguishing the plots. (If you’re viewing this in greyscale, be sure to try the example yourself.)

Figure 19.13 Faceted density plots for singer heights by voice part

density

0.2

0.1

0.0

0.2

0.1

0.0

0.2

0.1

0.0

0.2

0.1

0.0

0.2

0.1

0.0

0.2

0.1

0.0

0.2

0.1

0.0

0.2

0.1

0.0

 

 

 

2 Bass

 

 

 

1 Bass

 

 

 

2 Tenor

 

 

 

1 Tenor

 

 

 

2 Alto

 

 

 

1 Alto

 

 

 

2 Soprano

 

 

 

1 Soprano

60

65

70

75

voice.part

Bass 2

Bass 1

Tenor 2

Tenor 1

Alto 2

Alto 1

Soprano 2 Soprano 1

height

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]