Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Robert I. Kabacoff - R in action

.pdf
Скачиваний:
97
Добавлен:
02.06.2015
Размер:
12.13 Mб
Скачать

146

CHAPTER 7 Basic statistics

You have to give R more information to find it.

Now that you know how to generate descriptive statistics for the data as a whole, let’s review how to obtain statistics for subgroups of the data.

7.1.2Descriptive statistics by group

When comparing groups of individuals or observations, the focus is usually on the descriptive statistics of each group, rather than the total sample. Again, there are several ways to accomplish this in R. We’ll start by getting descriptive statistics for each level of transmission type.

In chapter 5, we discussed methods of aggregating data. You can use the aggregate() function (section 5.6.2) to obtain descriptive statistics by group, as shown in the following listing.

Listing 7.6 Descriptive statistics by group using aggregate()

> aggregate(mtcars[vars], by=list(am=mtcars$am), mean) am mpg hp wt

10 17.1 160 3.77

21 24.4 127 2.41

> aggregate(mtcars[vars], by=list(am=mtcars$am), sd) am mpg hp wt

10 3.83 53.9 0.777

21 6.17 84.1 0.617

Note the use of list(am=mtcars$am). If you had used list(mtcars$am), the am column would have been labeled Group.1 rather than am. You use the assignment to provide a more useful column label. If you have more than one grouping variable, you can use code like by=list(name1=groupvar1, name2=groupvar2, , groupvarN).

Unfortunately, aggregate() only allows you to use single value functions such as mean, standard deviation, and the like in each call. It won’t return several statistics at once. For that task, you can use the by() function. The format is

by(data, INDICES, FUN)

where data is a data frame or matrix, INDICES is a factor or list of factors that define the groups, and FUN is an arbitrary function. This next listing provides an example.

Listing 7.7 Descriptive statistics by group using by()

>dstats <- function(x)(c(mean=mean(x), sd=sd(x)))

>by(mtcars[vars], mtcars$am, dstats)

mtcars$am: 0

 

 

 

 

mean.mpg

mean.hp

mean.wt

sd.mpg

sd.hp

sd.wt

17.147

160.263

3.769

3.834

53.908

0.777

------------------------------------------------

 

mtcars$am: 1

 

 

 

 

mean.mpg

mean.hp

mean.wt

sd.mpg

sd.hp

sd.wt

24.392

126.846

2.411

6.167

84.062

0.617

Descriptive statistics

147

EXTENSIONS

The doBy package and the psych package also provide functions for descriptive statistics by group. Again, they aren’t distributed in the base installation and must be installed before first use. The summaryBy() function in the doBy package has the format

summaryBy(formula, data=dataframe, FUN=function)

where the formula takes the form

var1 + var2 + var3 + ... + varN ~ groupvar1 + groupvar2 + … + groupvarN

Variables on the left of the ~ are the numeric variables to be analyzed and variables on the right are categorical grouping variables. The function can be any built-in or usercreated R function. An example using the mystats() function you created in section 7.2.1 is shown in the following listing.

Listing 7.8 Summary statistics by group using summaryBy() in the doBy package

>library(doBy)

>summaryBy(mpg+hp+wt~am, data=mtcars, FUN=mystats)

am mpg.n mpg.mean mpg.stdev mpg.skew mpg.kurtosis hp.n hp.mean hp.stdev

1

0

19

17.1

3.83

0.0140

 

-0.803

19

160

53.9

2

1

13

24.4

6.17

0.0526

 

-1.455

13

127

84.1

 

hp.skew hp.kurtosis wt.n wt.mean wt.stdev wt.skew wt.kurtosis

 

1

-0.0142

-1.210

19

3.77

0.777

0.976

 

0.142

 

2

1.3599

0.563

13

2.41

0.617

0.210

 

-1.174

 

The describe.by() function contained in the psych package provides the same descriptive statistics as describe, stratified by one or more grouping variables, as you can see in the following listing.

Listing 7.9 Summary statistics by group using describe.by() in the psych package

>library(psych)

>describe.by(mtcars[vars], mtcars$am) group: 0

 

var

n

mean

sd

median trimmed

mad

min

max

mpg

1 19

17.15

3.83

 

17.30

17.12

3.11

10.40

24.40

hp

2 19

160.26

53.91

175.00

161.06

77.10

62.00

245.00

wt

3 19

3.77

0.78

 

3.52

3.75

0.45

2.46

5.42

 

range

skew kurtosis

se

 

 

 

 

mpg

14.00

0.01

-0.80

0.88

 

 

 

 

hp

183.00

-0.01

-1.21

12.37

 

 

 

 

wt

2.96

0.98

0.14

0.18

 

 

 

 

------------------------------------------------

 

group: 1

 

 

 

 

 

 

 

 

 

 

var

n

mean

sd

median trimmed

mad

min

max

mpg

1 13

24.39

6.17

 

22.80

24.38

6.67

15.00

33.90

hp

2 13

126.85

84.06

109.00

114.73

63.75

52.00

335.00

wt

3 13

2.41

0.62

 

2.32

2.39

0.68

1.51

3.57

 

range

skew kurtosis

 

se

 

 

 

 

148

 

 

 

CHAPTER 7 Basic statistics

mpg

18.90

0.05

-1.46

1.71

hp

283.00

1.36

0.56

23.31

wt

2.06

0.21

-1.17

0.17

Unlike the previous example, the describe.by() function doesn’t allow you to specify an arbitrary function, so it’s less generally applicable. If there’s more than one grouping variable, you can write them as list(groupvar1, groupvar2, , groupvarN). But this will only work if there are no empty cells when the grouping variables are crossed.

Finally, you can use the reshape package described in section 5.6.3 to derive descriptive statistics by group in a flexible way. (If you haven’t read that section, I suggest you review it before continuing.) First, you melt the data frame using

dfm <- melt(dataframe, measure.vars=y, id.vars=g)

where dataframe contains the data, y is a vector indicating the numeric variables to be summarized (the default is to use all), and g is a vector of one or more grouping variables. You then cast the data using

cast(dfm, groupvar1 + groupvar2 + … + variable ~ ., FUN)

where the grouping variables are separated by + signs, the word variable is entered exactly as is, and FUN is an arbitrary function.

In the final example of this section, we’ll apply the reshape approach to obtaining descriptive statistics for each subgroup formed by transmission type and number of cylinders. For descriptive statistics, we’ll get the sample size, mean, and standard deviation. The code and results are shown in the following listing.

Listing 7.10 Summary statistics by group via the reshape package

>library(reshape)

>dstats <- function(x)(c(n=length(x), mean=mean(x), sd=sd(x)))

>dfm <- melt(mtcars, measure.vars=c("mpg", "hp", "wt"),

id.vars=c("am", "cyl"))

> cast(dfm, am + cyl + variable ~ ., dstats)

 

am cyl variable

n

mean

sd

1

0

4

mpg

3

22.90

1.453

2

0

4

hp

3

84.67

19.655

3

0

4

wt

3

2.94

0.408

4

0

6

mpg

4

19.12

1.632

5

0

6

hp

4

115.25

9.179

6

0

6

wt

4

3.39

0.116

7

0

8

mpg

12

15.05

2.774

8

0

8

hp

12

194.17

33.360

9

0

8

wt

12

4.10

0.768

10

1

4

mpg

8

28.07

4.484

11

1

4

hp

8

81.88

22.655

12

1

4

wt

8

2.04

0.409

 

 

 

 

 

Frequency and contingency tables

149

13

1

6

mpg

3

20.57

0.751

 

14

1

6

hp

3

131.67

37.528

 

15

1

6

wt

3

2.75

0.128

 

16

1

8

mpg

2

15.40

0.566

 

17

1

8

hp

2

299.50

50.205

 

18

1

8

wt

2

3.37

0.283

 

Personally, I find this approach the most compact and appealing. Data analysts have their own preferences for which descriptive statistics to display and how they like to see them formatted. This is probably why there are many variations available. Choose the one that works best for you, or create your own!

7.1.3Visualizing results

Numerical summaries of a distribution’s characteristics are important, but they’re no substitute for a visual representation. For quantitative variables you have histograms (section 6.3), density plots (section 6.4), box plots (section 6.5), and dot plots (section 6.6). They can provide insights that are easily missed by reliance on a small set of descriptive statistics.

The functions considered so far provide summaries of quantitative variables. The functions in the next section allow you to examine the distributions of categorical variables.

7.2Frequency and contingency tables

In this section, we’ll look at frequency and contingency tables from categorical variables, along with tests of independence, measures of association, and methods for graphically displaying results. We’ll be using functions in the basic installation, along with functions from the vcd and gmodels package. In the following examples, assume that A, B, and C represent categorical variables.

The data for this section come from the Arthritis dataset included with the vcd package. The data are from Kock & Edward (1988) and represent a double-blind clinical trial of new treatments for rheumatoid arthritis. Here are the first few observations:

>library(vcd)

>head(Arthritis)

 

ID Treatment Sex Age Improved

1

57

Treated Male

27

Some

2

46

Treated Male

29

None

3

77

Treated Male

30

None

4

17

Treated Male

32

Marked

5

36

Treated Male

46

Marked

6

23

Treated Male

58

Marked

Treatment (Placebo, Treated), Sex (Male, Female), and Improved (None, Some, Marked) are all categorical factors. In the next section, we’ll create frequency and contingency tables (cross-classifications) from the data.

150

CHAPTER 7 Basic statistics

7.2.1Generating frequency tables

R provides several methods for creating frequency and contingency tables. The most important functions are listed in table 7.1.

Table 7.1 Functions for creating and manipulating contingency tables

Function

Description

 

 

table(var1, var2, …, varN)

Creates an N-way contingency table from N

 

categorical variables (factors)

xtabs(formula, data)

Creates an N-way contingency table based on a

 

formula and a matrix or data frame

prop.table(table, margins)

Expresses table entries as fractions of the marginal

 

table defined by the margins

margin.table(table, margins)

Computes the sum of table entries for a marginal

 

table defined by the margins

addmargins(table, margins)

Puts summar y margins (sums by default) on a table

ftable(table)

Creates a compact "flat" contingency table

 

 

In the following sections, we’ll use each of these functions to explore categorical variables. We’ll begin with simple frequencies, followed by two-way contingency tables, and end with multiway contingency tables. The first step is to create a table using either the table() or the xtabs() function, then manipulate it using the other functions.

ONE-WAY TABLES

You can generate simple frequency counts using the table() function. Here’s an example:

>mytable <- with(Arthritis, table(Improved))

>mytable

Improved

None Some Marked

42 14 28

You can turn these frequencies into proportions with prop.table():

> prop.table(mytable) Improved

None Some Marked 0.500 0.167 0.333

or into percentages, using prop.table()*100:

> prop.table(mytable)*100 Improved

None Some Marked 50.0 16.7 33.3

Frequency and contingency tables

151

Here you can see that 50 percent of study participants had some or marked improvement (16.7 + 33.3).

TWO-WAY TABLES

For two-way tables, the format for the table() function is

mytable <- table(A, B)

where A is the row variable, and B is the column variable. Alternatively, the xtabs() function allows you to create a contingency table using formula style input. The format is

mytable <- xtabs(~ A + B, data=mydata)

where mydata is a matrix or data frame. In general, the variables to be cross-classified appear on the right of the formula (that is, to the right of the ~) separated by + signs. If a variable is included on the left side of the formula, it’s assumed to be a vector of frequencies (useful if the data have already been tabulated).

For the Arthritis data, you have

>mytable <- xtabs(~ Treatment+Improved, data=Arthritis)

>mytable

Improved

Treatment None Some Marked

Placebo 29 7 7

Treated 13 7 21

You can generate marginal frequencies and proportions using the margin.table() and prop.table() functions, respectively. For row sums and row proportions, you have

> margin.table(mytable, 1) Treatment

Placebo Treated

4341

>prop.table(mytable, 1) Improved

Treatment None Some Marked

Placebo 0.674 0.163 0.163

Treated 0.317 0.171 0.512

The index (1) refers to the first variable in the table() statement. Looking at the table, you can see that 51 percent of treated individuals had marked improvement, compared to 16 percent of those receiving a placebo.

For column sums and column proportions, you have

> margin.table(mytable, 2) Improved

None Some Marked 42 14 28

> prop.table(mytable, 2) Improved

Treatment None Some Marked

152

CHAPTER 7 Basic statistics

Placebo 0.690 0.500 0.250

Treated 0.310 0.500 0.750

Here, the index (2) refers to the second variable in the table() statement. Cell proportions are obtained with this statement:

> prop.table(mytable) Improved

Treatment None Some Marked Placebo 0.3452 0.0833 0.0833 Treated 0.1548 0.0833 0.2500

You can use the addmargins() function to add marginal sums to these tables. For example, the following code adds a sum row and column:

> addmargins(mytable) Improved

Treatment None Some Marked Sum

Placebo

29

7

7

43

Treated

13

7

21

41

Sum

42

14

28

84

> addmargins(prop.table(mytable))

 

Improved

 

 

Treatment

None

Some

Marked

Sum

Placebo 0.3452

0.0833

0.0833

0.5119

Treated

0.1548

0.0833

0.2500

0.4881

Sum

0.5000

0.1667

0.3333

1.0000

When using addmargins(), the default is to create sum margins for all variables in a table. In contrast:

> addmargins(prop.table(mytable, 1), 2) Improved

Treatment None Some Marked Sum Placebo 0.674 0.163 0.163 1.000 Treated 0.317 0.171 0.512 1.000

adds a sum column alone. Similarly,

> addmargins(prop.table(mytable, 2), 1)

 

Improved

 

Treatment

None

Some Marked

Placebo 0.690

0.500

0.250

Treated

0.310

0.500

0.750

Sum

1.000

1.000

1.000

adds a sum row. In the table, you see that 25 percent of those patients with marked improvement received a placebo.

NOTE The table() function ignores missing values (NAs) by default. To include NA as a valid category in the frequency counts, include the table option useNA="ifany".

A third method for creating two-way tables is the CrossTable() function in the gmodels package. The CrossTable() function produces two-way tables modeled after PROC FREQ in SAS or CROSSTABS in SPSS. See listing 7.11 for an example.

Frequency and contingency tables

153

Listing 7.11 Two-way table using CrossTable

>library(gmodels)

>CrossTable(Arthritis$Treatment, Arthritis$Improved)

 

Cell Contents

 

 

 

 

 

 

 

 

|

-------------------------

 

 

|

 

 

 

 

 

|

 

 

N

|

 

 

 

 

 

| Chi-square contribution

|

 

 

 

 

 

|

N / Row

Total

|

 

 

 

 

 

|

N / Col

Total

|

 

 

 

 

 

|

N / Table

Total

|

 

 

 

 

 

|-------------------------

 

 

 

|

 

 

 

 

 

Total Observations in

Table: 84

 

 

 

 

 

 

 

|

Arthritis$Improved

 

 

 

 

Arthritis$Treatment

|

 

None |

Some |

Marked | Row Total |

-------------------- -----------

 

|

 

 

|

|

-----------

|

-----------|

 

Placebo

|

 

29

|

|

7

|

43 |

 

 

|

 

2.616

|

|

3.752

|

|

 

 

|

 

0.674

|

|

0.163

|

0.512 |

 

 

|

 

0.690

|

|

0.250

|

|

 

 

|

 

0.345

|

|

0.083

|

|

-------------------- -----------

 

|

 

 

|-----------

|-----------

 

|-----------

|

 

Treated

|

 

13

|

|

21

|

41 |

 

 

|

 

2.744

|

|

3.935

|

|

 

 

|

 

0.317

|

|

0.512

|

0.488 |

 

 

|

 

0.310

|

|

0.750

|

|

 

 

|

 

0.155

|

|

0.250

|

|

-------------------- -----------

 

|

 

 

|-----------

|-----------

 

|-----------

|

 

Column Total

|

 

42

|

|

28

|

84 |

 

 

|

 

0.500

|

|

0.333

|

|

-------------------- -----------

 

|

 

 

|-----------

|-----------

 

|-----------

|

The CrossTable() function has options to report percentages (row, column, cell); specify decimal places; produce chi-square, Fisher, and McNemar tests of independence; report expected and residual values (Pearson, standardized, adjusted standardized); include missing values as valid; annotate with row and column titles; and format as SAS or SPSS style output. See help(CrossTable) for details.

If you have more than two categorical variables, you’re dealing with multidimensional tables. We’ll consider these next.

MULTIDIMENSIONAL TABLES

Both table() and xtabs() can be used to generate multidimensional tables based on three or more categorical variables. The margin.table(), prop.table(), and addmargins() functions extend naturally to more than two dimensions. Additionally, the ftable() function can be used to print multidimensional tables in a compact and attractive manner. An example is given in listing 7.12.

154

CHAPTER 7 Basic statistics

Listing 7.12 Three-way contingency table

>mytable <- xtabs(~ Treatment+Sex+Improved, data=Arthritis)

>mytable

, , Improved = None

Sex

Treatment Female Male

Placebo 19 10

Treated 6 7

, , Improved = Some

Sex

Treatment Female Male

Placebo 7 0

Treated 5 2

, , Improved = Marked

Sex

Treatment Female Male

Placebo 6 1

Treated 16 5

> ftable(mytable)

 

 

 

 

Sex Female Male

Treatment

Improved

 

 

Placebo

None

19

10

 

Some

7

0

 

Marked

6

1

Treated

None

6

7

 

Some

5

2

 

Marked

16

5

Cell

. frequencies

> margin.table(mytable, 1) Treatment

Placebo Treated

4341

>margin.table(mytable, 2) Sex

Female Male

5925

>margin.table(mytable, 3) Improved

None Some Marked

42

14

28

 

> margin.table(mytable, c(1, 3))

 

Improved

 

 

Treatment

None Some Marked

Placebo

29

7

7

Treated

13

7

21

> ftable(prop.table(mytable, c(1, 2)))

 

 

Improved None Some Marked

Treatment

Sex

 

 

3

$

/

Marginal frequencies

Treatment x Improved marginal frequencies

Improve proportions for Treatment x Sex

 

Frequency and contingency tables

155

Placebo

Female

0.594

0.219

0.188

 

 

 

Male

0.909

0.000

0.091

 

 

Treated

Female

0.222

0.185

0.593

 

 

 

Male

0.500

0.143

0.357

 

 

> ftable(addmargins(prop.table(mytable, c(1, 2)), 3))

 

 

Improved

None

Some Marked

Sum

 

Treatment

Sex

 

 

 

 

 

Placebo

Female

0.594

0.219

0.188

1.000

 

 

Male

0.909

0.000

0.091

1.000

 

Treated

Female

0.222

0.185

0.593

1.000

 

 

Male

0.500

0.143

0.357

1.000

 

The code in . produces cell frequencies for the three-way classification. The code also demonstrates how the ftable() function can be used to print a more compact and attractive version of the table.

The code in 3produces the marginal frequencies for Treatment, Sex, and Improved. Because you created the table with the formula ~Treatement+Sex+Improve, Treatment is referred to by index 1, Sex is referred to by index 2, and Improve is referred to by index 3.

The code in $ produces the marginal frequencies for the Treatment x Improved classification, summed over Sex. The proportion of patients with None, Some, and Marked improvement for each Treatment x Sex combination is provided in /. Here you see that 36 percent of treated males had marked improvement, compared to 59 percent of treated females. In general, the proportions will add to one over the indices not included in the prop.table() call (the third index, or Improve in this case). You can see this in the last example, where you add a sum margin over the third index.

If you want percentages instead of proportions, you could multiply the resulting table by 100. For example:

ftable(addmargins(prop.table(mytable, c(1, 2)), 3)) * 100

would produce this table:

 

 

Sex Female

Male

Sum

Treatment

Improved

 

 

 

Placebo

None

65.5

34.5

100.0

 

Some

100.0

0.0

100.0

 

Marked

85.7

14.3

100.0

Treated

None

46.2

53.8

100.0

 

Some

71.4

28.6

100.0

 

Marked

76.2

23.8

100.0

While contingency tables tell you the frequency or proportions of cases for each combination of the variables that comprise the table, you’re probably also interested in whether the variables in the table are related or independent. Tests of independence are covered in the next section.

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]