Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Robert I. Kabacoff - R in action

.pdf
Скачиваний:
97
Добавлен:
02.06.2015
Размер:
12.13 Mб
Скачать

66

CHAPTER 3 Getting started with graphs

boxplot(wt, main="Boxplot of wt") par(opar)

detach(mtcars)

The results are presented in figure 3.14.

As a second example, let’s arrange 3 plots in 3 rows and 1 column. Here’s the code:

attach(mtcars)

opar <- par(no.readonly=TRUE) par(mfrow=c(3,1))

hist(wt)

hist(mpg)

hist(disp)

par(opar)

detach(mtcars)

The graph is displayed infigure 3.15. Note that the high-level function hist()includes a default title (use main="" to suppress it, or ann=FALSE to suppress all titles and labels).

 

Scatterplot of wt vs. mpg

 

Scatterplot of wt vs disp

 

30

 

 

 

 

400

 

 

 

mpg

20 25

 

 

 

disp

200 300

 

 

 

 

15

 

 

 

 

100

 

 

 

 

10

 

 

 

 

 

 

 

 

2

3

4

5

 

2

3

4

5

 

 

 

wt

 

 

 

 

wt

 

 

 

Histogram of wt

 

 

 

Boxplot of wt

 

 

8

 

 

 

 

5

 

 

 

Frequency

4 6

 

 

 

 

3 4

 

 

 

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

0

 

 

 

 

 

 

 

 

 

2

3

4

5

 

 

 

 

 

 

 

 

wt

 

 

 

 

 

 

Figure 3.14 Graph combining four figures through par(mfrow=c(2,2))

Combining graphs

67

Histogram of wt

 

8

Frequency

2 4 6

 

0

2

3

4

5

wt

Histogram of mpg

 

12

Frequency

4 6 8

 

2

 

0

10

15

20

25

30

35

mpg

Histogram of disp

 

6

Frequency

2 4

 

0

100

200

300

400

500

disp

Figure 3.15 Graph combining with three figures through par(mfrow=c(3,1))

The layout() function has the form layout(mat) where mat is a matrix object specifying the location of the multiple plots to combine. In the following code, one figure is placed in row 1 and two figures are placed in row 2:

attach(mtcars)

layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE)) hist(wt)

hist(mpg)

hist(disp)

detach(mtcars)

The resulting graph is presented in figure 3.16.

Optionally, you can include widths= and heights= options in the layout() function to control the size of each figure more precisely. These options have the form

widths = a vector of values for the widths of columns heights = a vector of values for the heights of rows

Relative widths are specified with numeric values. Absolute widths (in centimeters) are specified with the lcm() function.

68

 

 

 

 

 

 

 

 

 

 

 

CHAPTER 3

Getting started with graphs

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Histogram of wt

 

 

 

 

 

 

 

 

 

 

 

 

 

8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Frequency

4 6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

3

 

 

 

4

 

 

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

wt

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Histogram of mpg

 

 

 

 

 

 

 

 

 

 

 

Histogram of disp

 

 

 

12

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

7

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

10

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Frequency

4 6 8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Frequency

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3 4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

10

15

20

25

30

35

 

 

100

200

300

400

 

500

 

 

 

 

 

 

 

 

 

 

mpg

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

disp

 

 

 

 

 

 

 

Figure 3.16 Graph combining three figures using the layout() function with default widths

In the following code, one figure is again placed in row 1 and two figures are placed in row 2. But the figure in row 1 is one-third the height of the figures in row 2. Additionally, the figure in the bottom-right cell is one-fourth the width of the figure in the bottom-left cell:

attach(mtcars)

layout(matrix(c(1, 1, 2, 3), 2, 2, byrow = TRUE), widths=c(3, 1), heights=c(1, 2))

hist(wt)

hist(mpg)

hist(disp)

detach(mtcars)

The graph is presented in figure 3.17.

As you can see, the layout() function gives you easy control over both the number and placement of graphs in a final image and the relative sizes of these graphs. See help(layout) for more details.

 

 

 

 

 

 

 

 

 

 

 

 

 

Combining graphs

 

 

 

 

 

 

 

 

 

 

 

69

 

 

 

 

 

 

 

 

 

 

 

 

 

Histogram of wt

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Frequency

4 8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

3

 

 

4

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

wt

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Histogram of mpg

 

 

 

 

 

 

 

 

Histogram of disp

 

12

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

7

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

10

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Frequency

8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Frequency

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3 4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

10

15

20

25

30

 

35

 

 

100

400

 

 

 

 

 

 

 

 

 

 

 

 

 

mpg

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

disp

Figure 3.17 Graph combining three figures using the layout() function with specified widths

3.5.1Creating a figure arrangement with fine control

There are times when you want to arrange or superimpose several figures to create a single meaningful plot. Doing so requires fine control over the placement of the figures. You can accomplish this with the fig= graphical parameter. In the following listing, two box plots are added to a scatter plot to create a single enhanced graph. The resulting graph is shown in figure 3.18.

Listing 3.4 Fine placement of figures in a graph

opar <- par(no.readonly=TRUE)

 

 

par(fig=c(0, 0.8, 0, 0.8))

 

Set up scatter plot

 

plot(mtcars$wt, mtcars$mpg,

 

 

xlab="Miles Per Gallon",

 

 

ylab="Car Weight")

 

 

par(fig=c(0, 0.8, 0.55, 1), new=TRUE)

 

Add box plot above

 

boxplot(mtcars$wt, horizontal=TRUE, axes=FALSE)

 

 

70

CHAPTER 3 Getting started with graphs

 

 

 

par(fig=c(0.65, 1, 0, 0.8), new=TRUE)

 

Add box plot to right

 

 

boxplot(mtcars$mpg, axes=FALSE)

mtext("Enhanced Scatterplot", side=3, outer=TRUE, line=-3) par(opar)

To understand how this graph was created, think of the full graph area as going from (0,0) in the lower-left corner to (1,1) in the upper-right corner. Figure 3.19 will help you visualize this. The format of the fig= parameter is a numerical vector of the form c(x1, x2, y1, y2).

The first fig= sets up the scatter plot going from 0 to 0.8 on the x-axis and 0 to 0.8 on the y-axis. The top box plot goes from 0 to 0.8 on the x-axis and 0.55 to 1 on the y-axis. The right-hand box plot goes from 0.65 to 1 on the x-axis and 0 to 0.8 on the y-axis. fig= starts a new plot, so when adding a figure to an existing graph, include the new=TRUE option.

I chose 0.55 rather than 0.8 so that the top figure would be pulled closer to the scatter plot. Similarly, I chose 0.65 to pull the right-hand box plot closer to the scatter plot. You have to experiment to get the placement right.

Enhanced Scatterplot

 

30

Car Weight

25

20

 

15

 

10

2

3

4

5

Miles Per Gallon

Figure 3.18 A scatter plot with two box plots added to the margins

Summary

71

(1,1)

y2

y1

x1

x2

(0,0)

Figure 3.19 Specifying locations using the fig= graphical parameter

NOTE The amount of space needed for individual subplots can be device dependent. If you get “Error in plot.new(): figure margins too large,” try varying the area given for each portion of the overall graph.

You can use fig= graphical parameter to combine several plots into any arrangement within a single graph. With a little practice, this approach gives you a great deal of flexibility when creating complex visual presentations.

3.6Summary

In this chapter, we reviewed methods for creating graphs and saving them in a variety of formats. The majority of the chapter was concerned with modifying the default graphs produced by R, in order to arrive at more useful or attractive plots. You learned how to modify a graph’s axes, fonts, symbols, lines, and colors, as well as how to add titles, subtitles, labels, plotted text, legends, and reference lines. You saw how to specify the size of the graph and margins, and how to combine multiple graphs into a single useful image.

Our focus in this chapter was on general techniques that you can apply to all graphs (with the exception of lattice graphs in chapter 16). Later chapters look at specific types of graphs. For example, chapter 7 covers methods for graphing a single variable. Graphing relationships between variables will be described in chapter 11. In chapter 16, we discuss advanced graphic methods, including lattice graphs (graphs that display the relationship between variables, for each level of other variables) and interactive graphs. Interactive graphs let you use the mouse to dynamically explore the plotted relationships.

In other chapters, we’ll discuss methods of visualizing data that are particularly useful for the statistical approaches under consideration. Graphs are a central part of

72

CHAPTER 3 Getting started with graphs

modern data analysis, and I’ll endeavor to incorporate them into each of the statistical approaches we discuss.

In the previous chapter we discussed a range of methods for inputting or importing data into R. Unfortunately, in the real world your data is rarely usable in the format in which you first get it. In the next chapter we look at ways to transform and massage our data into a state that’s more useful and conducive to analysis.

Basic data management4

This chapter covers

Manipulating dates and missing values

Understanding data type conversions

Creating and recoding variables

Sor ting, merging, and subsetting datasets

Selecting and dropping variables

In chapter 2, we covered a variety of methods for importing data into R. Unfortunately, getting our data in the rectangular arrangement of a matrix or data frame is the first step in preparing it for analysis. To paraphrase Captain Kirk in the Star Trek episode “A Taste of Armageddon” (and proving my geekiness once and for all): “Data is a messy business—a very, very messy business.” In my own work, as much as 60 percent of the time I spend on data analysis is focused on preparing the data for analysis. I’ll go out a limb and say that the same is probably true in one form or another for most real-world data analysts. Let’s take a look at an example.

4.1A working example

One of the topics that I study in my current job is how men and women differ in the ways they lead their organizations. Typical questions might be

Do men and women in management positions differ in the degree to which they defer to superiors?

73

74

CHAPTER 4 Basic data management

Does this vary from country to country, or are these gender differences universal?

One way to address these questions is to have bosses in multiple countries rate their managers on deferential behavior, using questions like the following:

This manager asks my opinion before making personnel decisions.

 

1

2

3

4

5

strongly

disagree

neither agree

agree

strongly

disagree

 

nor disagree

 

agree

The resulting data might resemble those in table 4.1. Each row represents the ratings given to a manager by his or her boss.

Table 4.1 Gender differences in leadership behavior

Manager

Date

Country

Gender

Age

q1

q2

q3

q4

q5

 

 

 

 

 

 

 

 

 

 

1

10/24/08

US

M

32

5

4

5

5

5

2

10/28/08

US

F

45

3

5

2

5

5

3

10/01/08

UK

F

25

3

5

5

5

2

4

10/12/08

UK

M

39

3

3

4

 

 

5

05/01/09

UK

F

99

2

2

1

2

1

 

 

 

 

 

 

 

 

 

 

Here, each manager is rated by their boss on five statements (q1 to q5) related to deference to authority. For example, manager 1 is a 32-year-old male working in the US and is rated deferential by his boss, while manager 5 is a female of unknown age (99 probably indicates missing) working in the UK and is rated low on deferential behavior. The date column captures when the ratings were made.

Although a dataset might have dozens of variables and thousands of observations, we’ve only included 10 columns and 5 rows to simplify the examples. Additionally, we’ve limited the number of items pertaining to the managers’ deferential behavior to 5. In a real-world study, you’d probably use 10–20 such items to improve the reliability and validity of the results. You can create a data frame containing the data in table 4.1 using the following code.

Listing 4.1 Creating the leadership data frame

manager <- c(1, 2, 3, 4, 5)

date <- c("10/24/08", "10/28/08", "10/1/08", "10/12/08", "5/1/09")

country <-

c("US", "US", "UK", "UK", "UK")

gender <- c("M",

"F", "F", "M", "F")

age <- c(32, 45,

25, 39, 99)

q1

<- c(5,

3, 3,

3, 2)

q2

<- c(4,

5, 5,

3, 2)

q3

<- c(5,

2, 5,

4, 1)

q4

<- c(5,

5, 5,

NA, 2)

q5

<- c(5,

5, 2,

NA, 1)

leadership

<- data.frame(manager, date, country, gender, age,

q1, q2, q3, q4, q5, stringsAsFactors=FALSE)

Creating new variables

75

In order to address the questions of interest, we must first address several data management issues. Here’s a partial list:

The five ratings (q1 to q5) will need to be combined, yielding a single mean deferential score from each manager.

In surveys, respondents often skip questions. For example, the boss rating manager 4 skipped questions 4 and 5. We’ll need a method of handling incomplete data. We’ll also need to recode values like 99 for age to missing.

There may be hundreds of variables in a dataset, but we may only be interested in a few. To simplify matters, we’ll want to create a new dataset with only the variables of interest.

Past research suggests that leadership behavior may change as a function of the manager’s age. To examine this, we may want to recode the current values of age into a new categorical age grouping (for example, young, middle-aged, elder).

Leadership behavior may change over time. We might want to focus on deferential behavior during the recent global financial crisis. To do so, we may want to limit the study to data gathered during a specific period of time (say, January 1, 2009 to December 31, 2009).

We’ll work through each of these issues in the current chapter, as well as other basic data management tasks such as combining and sorting datasets. Then in chapter 5 we’ll look at some advanced topics.

4.2Creating new variables

In a typical research project, you’ll need to create new variables and transform existing ones. This is accomplished with statements of the form

variable <- expression

A wide array of operators and functions can be included in the expression portion of the statement. Table 4.2 lists R’s arithmetic operators. Arithmetic operators are used when developing formulas.

Table 4.2 Arithmetic operators

Operator

Description

 

 

+

Addition

-

Subtraction

*

Multiplication

/

Division

^ or **

Exponentiation

x%%y

Modulus (x mod y) 5%%2 is 1

x%/%y

Integer division 5%/%2 is 2

 

 

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]