Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
540
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

144

CHAPTER 7 Basic statistics

Listing 7.9 Summary statistics by group using describe.by() in the psych package

>library(psych)

>myvars <- c("mpg", "hp", "wt")

>describeBy(mtcars[myvars], list(am=mtcars$am))

am: 0

 

 

 

 

 

 

 

 

 

 

var

n

mean

sd

median

trimmed

mad

min

max

mpg

1

19

17.15

3.83

17.30

17.12

3.11

10.40

24.40

hp

2

19

160.26

53.91

175.00

161.06

77.10

62.00

245.00

wt

3

19

3.77

0.78

3.52

3.75

0.45

2.46

5.42

 

range

 

skew

kurtosis

se

 

 

 

 

mpg

14.00

 

0.01

-0.80

0.88

 

 

 

 

hp

183.00

 

-0.01

-1.21

12.37

 

 

 

 

wt

2.96

 

0.98

0.14

0.18

 

 

 

 

----------------------------------------------------------------------------

am: 1

 

 

 

 

 

 

 

 

 

 

 

var

n

mean

sd

median

trimmed

mad

min

max

mpg

1

13

24.39

6.17

22.80

24.38

6.67

15.00

33.90

hp

2

13

126.85

84.06

109.00

114.73

63.75

52.00

335.00

wt

3

13

 

2.41

0.62

2.32

2.39

0.68

1.51

3.57

 

range

 

skew

kurtosis

se

 

 

 

 

mpg

18.90

 

0.05

 

-1.46

1.71

 

 

 

 

hp

283.00

 

1.36

 

0.56

23.31

 

 

 

 

wt

2.06

 

0.21

 

-1.17

0.17

 

 

 

 

Unlike the previous example, the describeBy() function doesn’t allow you to specify an arbitrary function, so it’s less generally applicable. If there’s more than one grouping variable, you can write them as list(name1=groupvar1, name2=groupvar2, ... , nameN=groupvarN). But this will work only if there are no empty cells when the grouping variables are crossed.

Data analysts have their own preferences for which descriptive statistics to display and how they like to see them formatted. This is probably why there are many variations available. Choose the one that works best for you, or create your own!

7.1.5Visualizing results

Numerical summaries of a distribution’s characteristics are important, but they’re no substitute for a visual representation. For quantitative variables, you have histograms (section 6.3), density plots (section 6.4), box plots (section 6.5), and dot plots (section 6.6). They can provide insights that are easily missed by reliance on a small set of descriptive statistics.

The functions considered so far provide summaries of quantitative variables. The functions in the next section allow you to examine the distributions of categorical variables.

7.2Frequency and contingency tables

In this section, we’ll look at frequency and contingency tables from categorical variables, along with tests of independence, measures of association, and methods for

Frequency and contingency tables

145

graphically displaying results. We’ll be using functions in the basic installation, along with functions from the vcd and gmodels packages. In the following examples, assume that A, B, and C represent categorical variables.

The data for this section come from the Arthritis dataset included with the vcd package. The data are from Kock & Edward (1988) and represent a double-blind clinical trial of new treatments for rheumatoid arthritis. Here are the first few observations:

>library(vcd)

>head(Arthritis)

 

ID

Treatment

Sex

Age

Improved

1

57

Treated

Male

27

Some

2

46

Treated

Male

29

None

3

77

Treated

Male

30

None

4

17

Treated

Male

32

Marked

5

36

Treated

Male

46

Marked

6

23

Treated

Male

58

Marked

Treatment (Placebo, Treated), Sex (Male, Female), and Improved (None, Some, Marked) are all categorical factors. In the next section, you’ll create frequency and contingency tables (cross-classifications) from the data.

7.2.1Generating frequency tables

R provides several methods for creating frequency and contingency tables. The most important functions are listed in table 7.1.

Table 7.1 Functions for creating and manipulating contingency tables

Function

Description

 

 

table(var1, var2, ..., varN)

Creates an N-way contingency table from N categorical vari-

 

ables (factors)

xtabs(formula, data)

Creates an N-way contingency table based on a formula and a

 

matrix or data frame

prop.table(table, margins)

Expresses table entries as fractions of the marginal table

 

defined by the margins

margin.table(table, margins)

Computes the sum of table entries for a marginal table

 

defined by the margins

addmargins(table, margins)

Puts summary margins (sums by default) on a table

ftable(table)

Creates a compact, “flat” contingency table

 

 

In the following sections, we’ll use each of these functions to explore categorical variables. We’ll begin with simple frequencies, followed by two-way contingency tables, and end with multiway contingency tables. The first step is to create a table using either the table() or xtabs() function and then manipulate it using the other functions.

146

CHAPTER 7 Basic statistics

ONE-WAY TABLES

You can generate simple frequency counts using the table() function. Here’s an example:

>mytable <- with(Arthritis, table(Improved))

>mytable

Improved

None Some Marked

42 14 28

You can turn these frequencies into proportions with prop.table()

> prop.table(mytable) Improved

None Some Marked 0.500 0.167 0.333

or into percentages using prop.table()*100:

> prop.table(mytable)*100 Improved

None Some Marked

50.016.7 33.3

Here you can see that 50% of study participants had some or marked improvement (16.7 + 33.3).

TWO-WAY TABLES

For two-way tables, the format for the table() function is

mytable <- table(A, B)

where A is the row variable and B is the column variable. Alternatively, the xtabs() function allows you to create a contingency table using formula-style input. The format is

mytable <- xtabs(~ A + B, data=mydata)

where mydata is a matrix or data frame. In general, the variables to be cross-classified appear on the right of the formula (that is, to the right of the ~) separated by + signs. If a variable is included on the left side of the formula, it’s assumed to be a vector of frequencies (useful if the data have already been tabulated).

For the Arthritis data, you have

>mytable <- xtabs(~ Treatment+Improved, data=Arthritis)

>mytable

 

Improved

 

Treatment

None

Some

Marked

Placebo

29

7

7

Treated

13

7

21

You can generate marginal frequencies and proportions using the margin.table() and prop.table() functions, respectively. For row sums and row proportions, you have

> margin.table(mytable, 1) Treatment

Placebo Treated 3 41

Frequency and contingency tables

147

> prop.table(mytable, 1)

Improved

Treatment None Some Marked

Placebo 0.674 0.163 0.163

Treated 0.317 0.171 0.512

The index (1) refers to the first variable in the table() statement. Looking at the table, you can see that 51% of treated individuals had marked improvement, compared to 16% of those receiving a placebo.

For column sums and column proportions, you have

> margin.table(mytable, 2)

Improved

 

 

 

None

Some

Marked

 

42

14

28

 

> prop.table(mytable, 2)

 

 

Improved

 

Treatment

None

Some

Marked

Placebo

0.690

0.500

0.250

Treated

0.310

0.500

0.750

Here, the index (2) refers to the second variable in the table() statement. Cell proportions are obtained with this statement:

> prop.table(mytable)

Improved

Treatment None Some Marked

Placebo 0.3452 0.0833 0.0833

Treated 0.1548 0.0833 0.2500

You can use the addmargins() function to add marginal sums to these tables. For example, the following code adds a Sum row and column:

> addmargins(mytable)

 

 

 

Improved

 

 

Treatment

None

Some

Marked

Sum

Placebo

29

7

7

43

Treated

13

7

21

41

Sum

42

14

28

84

> addmargins(prop.table(mytable))

 

 

Improved

 

 

Treatment

None

Some

Marked

Sum

Placebo 0.3452 0.0833 0.0833 0.5119

Treated

0.1548

0.0833

0.2500

0.4881

Sum

0.5000

0.1667

0.3333

1.0000

When using addmargins(), the default is to create sum margins for all variables in a table. In contrast, the following code adds a Sum column alone:

> addmargins(prop.table(mytable, 1), 2)

Improved

Treatment None Some Marked Sum

Placebo 0.674 0.163 0.163 1.000

Treated 0.317 0.171 0.512 1.000

148

CHAPTER 7 Basic statistics

Similarly, this code adds a Sum row:

> addmargins(prop.table(mytable, 2), 1)

 

Improved

 

Treatment

None

Some

Marked

Placebo

0.690

0.500

0.250

Treated

0.310

0.500

0.750

Sum

1.000

1.000

1.000

In the table, you see that 25% of those patients with marked improvement received a placebo.

NOTE The table() function ignores missing values (NAs) by default. To include NA as a valid category in the frequency counts, include the table option useNA="ifany".

A third method for creating two-way tables is the CrossTable() function in the gmodels package. The CrossTable() function produces two-way tables modeled after PROC FREQ in SAS or CROSSTABS in SPSS. The following listing shows an example.

Listing 7.10 Two-way table using CrossTable

>library(gmodels)

>CrossTable(Arthritis$Treatment, Arthritis$Improved)

 

Cell Contents

 

 

 

 

 

 

 

 

 

|

-------------------------

 

 

|

 

 

 

 

 

 

|

 

 

N

|

 

 

 

 

 

 

| Chi-square contribution

|

 

 

 

 

 

 

|

N / Row

Total

|

 

 

 

 

 

 

|

N / Col

Total

|

 

 

 

 

 

 

|

N / Table

Total

|

 

 

 

 

 

 

|-------------------------

 

 

 

|

 

 

 

 

 

 

Total Observations in

Table: 84

 

 

 

 

 

 

 

 

|

Arthritis$Improved

 

 

 

 

Arthritis$Treatment

|

 

None |

Some |

Marked | Row Total |

-------------------- -----------

 

|

 

 

|-----------

 

|-----------

 

|-----------

|

 

Placebo

|

 

29

|

7

|

7

|

43 |

 

 

|

 

2.616

|

0.004

|

3.752

|

|

 

 

|

 

0.674

|

0.163

|

0.163

|

0.512 |

 

 

|

 

0.690

|

0.500

|

0.250

|

|

 

 

|

 

0.345

|

0.083

|

0.083

|

|

-------------------- -----------

 

|

 

 

|-----------

 

|-----------

 

|-----------

|

 

Treated

|

 

13

|

7

|

21

|

41 |

 

 

|

 

2.744

|

0.004

|

3.935

|

|

 

 

|

 

0.317

|

0.171

|

0.512

|

0.488 |

 

 

|

 

0.310

|

0.500

|

0.750

|

|

 

 

|

 

0.155

|

0.083

|

0.250

|

|

-------------------- -----------

 

|

 

 

|-----------

 

|-----------

 

|-----------

|

 

Column Total

|

 

42

|

14

|

28

|

84 |

 

 

|

 

0.500

|

0.167

|

0.333

|

|

-------------------- -----------

 

|

 

 

|-----------

 

|-----------

 

|-----------

|

Frequency and contingency tables

149

The CrossTable() function has options to report percentages (row, column, and cell); specify decimal places; produce chi-square, Fisher, and McNemar tests of independence; report expected and residual values (Pearson, standardized, and adjusted standardized); include missing values as valid; annotate with row and column titles; and format as SAS or SPSS style output. See help(CrossTable) for details.

If you have more than two categorical variables, you’re dealing with multidimensional tables. We’ll consider these next.

MULTIDIMENSIONAL TABLES

Both table() and xtabs() can be used to generate multidimensional tables based on three or more categorical variables. The margin.table(), prop.table(), and addmargins() functions extend naturally to more than two dimensions. Additionally, the ftable() function can be used to print multidimensional tables in a compact and attractive manner. An example is given in the next listing.

Listing 7.11 Three-way contingency table

> mytable <- xtabs(~ Treatment+Sex+Improved, data=Arthritis)

>

mytable

b Cell frequencies

,

, Improved = None

 

Sex

Treatment Female Male

Placebo 19 10

Treated 6 7

, , Improved = Some

 

Sex

 

 

Treatment

Female

Male

 

Placebo

7

0

 

Treated

5

2

 

, , Improved = Marked

 

 

Sex

 

 

Treatment

Female

Male

 

Placebo

6

1

 

Treated

16

5

 

> ftable(mytable)

 

 

 

 

Sex Female Male

Treatment

Improved

 

 

Placebo

None

19

10

 

Some

7

0

 

Marked

6

1

Treated

None

6

7

 

Some

5

2

 

Marked

16

5

c Marginal frequencies

> margin.table(mytable, 1)

150

CHAPTER 7 Basic statistics

Treatment

Placebo Treated

4341

>margin.table(mytable, 2) Sex

Female Male

5925

>margin.table(mytable, 3) Improved

None

Some Marked

 

 

 

 

d Treatment × Improved

42

14

28

 

 

 

 

> margin.table(mytable, c(1, 3))

 

 

 

 

marginal frequencies

 

 

 

 

 

 

 

 

Improved

 

 

 

 

 

 

 

 

 

 

Treatment

None Some Marked

 

 

 

 

 

 

 

 

Placebo

29

7

7

 

 

 

 

 

 

e Improved proportions

Treated

13

7

21

 

 

 

 

 

 

> ftable(prop.table(mytable, c(1, 2)))

 

 

 

 

 

for Treatment × Sex

 

 

 

 

 

 

 

 

Improved

None

Some Marked

 

 

 

Treatment

Sex

 

 

 

 

 

 

 

 

 

 

Placebo

Female

 

 

0.594

0.219

0.188

 

 

 

 

Male

 

 

0.909

0.000

0.091

 

 

 

Treated

Female

 

 

0.222

0.185

0.593

 

 

 

 

Male

 

 

0.500

0.143

0.357

 

 

 

> ftable(addmargins(prop.table(mytable, c(1,

2)), 3))

 

 

 

Improved

None

Some Marked

Sum

Treatment

Sex

 

 

 

 

 

 

 

 

 

 

Placebo

Female

 

 

0.594

0.219

0.188

1.000

 

 

 

Male

 

 

0.909

0.000

0.091

1.000

 

 

Treated

Female

 

 

0.222

0.185

0.593

1.000

 

 

 

Male

 

 

0.500

0.143

0.357

1.000

 

 

The code at b produces cell frequencies for the three-way classification. The code also demonstrates how the ftable() function can be used to print a more compact and attractive version of the table.

The code at c produces the marginal frequencies for Treatment, Sex, and Improved. Because you created the table with the formula ~Treatment+Sex+ Improved, Treatment is referred to by index 1, Sex is referred to by index 2, and Improved is referred to by index 3.

The code at d produces the marginal frequencies for the Treatment x Improved classification, summed over Sex. The proportion of patients with None, Some, and Marked improvement for each Treatment × Sex combination is provided in e. Here you see that 36% of treated males had marked improvement, compared to 59% of treated females. In general, the proportions will add to 1 over the indices not included in the prop.table() call (the third index, or Improved in this case). You can see this in the last example, where you add a sum margin over the third index.

If you want percentages instead of proportions, you can multiply the resulting table by 100. For example, this statement

ftable(addmargins(prop.table(mytable, c(1, 2)), 3)) * 100

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]