
Robert I. Kabacoff - R in action
.pdf
146 |
CHAPTER 7 Basic statistics |
You have to give R more information to find it.
Now that you know how to generate descriptive statistics for the data as a whole, let’s review how to obtain statistics for subgroups of the data.
7.1.2Descriptive statistics by group
When comparing groups of individuals or observations, the focus is usually on the descriptive statistics of each group, rather than the total sample. Again, there are several ways to accomplish this in R. We’ll start by getting descriptive statistics for each level of transmission type.
In chapter 5, we discussed methods of aggregating data. You can use the aggregate() function (section 5.6.2) to obtain descriptive statistics by group, as shown in the following listing.
Listing 7.6 Descriptive statistics by group using aggregate()
> aggregate(mtcars[vars], by=list(am=mtcars$am), mean) am mpg hp wt
10 17.1 160 3.77
21 24.4 127 2.41
> aggregate(mtcars[vars], by=list(am=mtcars$am), sd) am mpg hp wt
10 3.83 53.9 0.777
21 6.17 84.1 0.617
Note the use of list(am=mtcars$am). If you had used list(mtcars$am), the am column would have been labeled Group.1 rather than am. You use the assignment to provide a more useful column label. If you have more than one grouping variable, you can use code like by=list(name1=groupvar1, name2=groupvar2, … , groupvarN).
Unfortunately, aggregate() only allows you to use single value functions such as mean, standard deviation, and the like in each call. It won’t return several statistics at once. For that task, you can use the by() function. The format is
by(data, INDICES, FUN)
where data is a data frame or matrix, INDICES is a factor or list of factors that define the groups, and FUN is an arbitrary function. This next listing provides an example.
Listing 7.7 Descriptive statistics by group using by()
>dstats <- function(x)(c(mean=mean(x), sd=sd(x)))
>by(mtcars[vars], mtcars$am, dstats)
mtcars$am: 0 |
|
|
|
|
|
mean.mpg |
mean.hp |
mean.wt |
sd.mpg |
sd.hp |
sd.wt |
17.147 |
160.263 |
3.769 |
3.834 |
53.908 |
0.777 |
------------------------------------------------ |
|
||||
mtcars$am: 1 |
|
|
|
|
|
mean.mpg |
mean.hp |
mean.wt |
sd.mpg |
sd.hp |
sd.wt |
24.392 |
126.846 |
2.411 |
6.167 |
84.062 |
0.617 |

Descriptive statistics |
147 |
EXTENSIONS
The doBy package and the psych package also provide functions for descriptive statistics by group. Again, they aren’t distributed in the base installation and must be installed before first use. The summaryBy() function in the doBy package has the format
summaryBy(formula, data=dataframe, FUN=function)
where the formula takes the form
var1 + var2 + var3 + ... + varN ~ groupvar1 + groupvar2 + … + groupvarN
Variables on the left of the ~ are the numeric variables to be analyzed and variables on the right are categorical grouping variables. The function can be any built-in or usercreated R function. An example using the mystats() function you created in section 7.2.1 is shown in the following listing.
Listing 7.8 Summary statistics by group using summaryBy() in the doBy package
>library(doBy)
>summaryBy(mpg+hp+wt~am, data=mtcars, FUN=mystats)
am mpg.n mpg.mean mpg.stdev mpg.skew mpg.kurtosis hp.n hp.mean hp.stdev
1 |
0 |
19 |
17.1 |
3.83 |
0.0140 |
|
-0.803 |
19 |
160 |
53.9 |
2 |
1 |
13 |
24.4 |
6.17 |
0.0526 |
|
-1.455 |
13 |
127 |
84.1 |
|
hp.skew hp.kurtosis wt.n wt.mean wt.stdev wt.skew wt.kurtosis |
|
||||||||
1 |
-0.0142 |
-1.210 |
19 |
3.77 |
0.777 |
0.976 |
|
0.142 |
|
|
2 |
1.3599 |
0.563 |
13 |
2.41 |
0.617 |
0.210 |
|
-1.174 |
|
The describe.by() function contained in the psych package provides the same descriptive statistics as describe, stratified by one or more grouping variables, as you can see in the following listing.
Listing 7.9 Summary statistics by group using describe.by() in the psych package
>library(psych)
>describe.by(mtcars[vars], mtcars$am) group: 0
|
var |
n |
mean |
sd |
median trimmed |
mad |
min |
max |
||
mpg |
1 19 |
17.15 |
3.83 |
|
17.30 |
17.12 |
3.11 |
10.40 |
24.40 |
|
hp |
2 19 |
160.26 |
53.91 |
175.00 |
161.06 |
77.10 |
62.00 |
245.00 |
||
wt |
3 19 |
3.77 |
0.78 |
|
3.52 |
3.75 |
0.45 |
2.46 |
5.42 |
|
|
range |
skew kurtosis |
se |
|
|
|
|
|||
mpg |
14.00 |
0.01 |
-0.80 |
0.88 |
|
|
|
|
||
hp |
183.00 |
-0.01 |
-1.21 |
12.37 |
|
|
|
|
||
wt |
2.96 |
0.98 |
0.14 |
0.18 |
|
|
|
|
||
------------------------------------------------ |
|
|||||||||
group: 1 |
|
|
|
|
|
|
|
|
|
|
|
var |
n |
mean |
sd |
median trimmed |
mad |
min |
max |
||
mpg |
1 13 |
24.39 |
6.17 |
|
22.80 |
24.38 |
6.67 |
15.00 |
33.90 |
|
hp |
2 13 |
126.85 |
84.06 |
109.00 |
114.73 |
63.75 |
52.00 |
335.00 |
||
wt |
3 13 |
2.41 |
0.62 |
|
2.32 |
2.39 |
0.68 |
1.51 |
3.57 |
|
|
range |
skew kurtosis |
|
se |
|
|
|
|

148 |
|
|
|
CHAPTER 7 Basic statistics |
mpg |
18.90 |
0.05 |
-1.46 |
1.71 |
hp |
283.00 |
1.36 |
0.56 |
23.31 |
wt |
2.06 |
0.21 |
-1.17 |
0.17 |
Unlike the previous example, the describe.by() function doesn’t allow you to specify an arbitrary function, so it’s less generally applicable. If there’s more than one grouping variable, you can write them as list(groupvar1, groupvar2, … , groupvarN). But this will only work if there are no empty cells when the grouping variables are crossed.
Finally, you can use the reshape package described in section 5.6.3 to derive descriptive statistics by group in a flexible way. (If you haven’t read that section, I suggest you review it before continuing.) First, you melt the data frame using
dfm <- melt(dataframe, measure.vars=y, id.vars=g)
where dataframe contains the data, y is a vector indicating the numeric variables to be summarized (the default is to use all), and g is a vector of one or more grouping variables. You then cast the data using
cast(dfm, groupvar1 + groupvar2 + … + variable ~ ., FUN)
where the grouping variables are separated by + signs, the word variable is entered exactly as is, and FUN is an arbitrary function.
In the final example of this section, we’ll apply the reshape approach to obtaining descriptive statistics for each subgroup formed by transmission type and number of cylinders. For descriptive statistics, we’ll get the sample size, mean, and standard deviation. The code and results are shown in the following listing.
Listing 7.10 Summary statistics by group via the reshape package
>library(reshape)
>dstats <- function(x)(c(n=length(x), mean=mean(x), sd=sd(x)))
>dfm <- melt(mtcars, measure.vars=c("mpg", "hp", "wt"),
id.vars=c("am", "cyl"))
> cast(dfm, am + cyl + variable ~ ., dstats)
|
am cyl variable |
n |
mean |
sd |
||
1 |
0 |
4 |
mpg |
3 |
22.90 |
1.453 |
2 |
0 |
4 |
hp |
3 |
84.67 |
19.655 |
3 |
0 |
4 |
wt |
3 |
2.94 |
0.408 |
4 |
0 |
6 |
mpg |
4 |
19.12 |
1.632 |
5 |
0 |
6 |
hp |
4 |
115.25 |
9.179 |
6 |
0 |
6 |
wt |
4 |
3.39 |
0.116 |
7 |
0 |
8 |
mpg |
12 |
15.05 |
2.774 |
8 |
0 |
8 |
hp |
12 |
194.17 |
33.360 |
9 |
0 |
8 |
wt |
12 |
4.10 |
0.768 |
10 |
1 |
4 |
mpg |
8 |
28.07 |
4.484 |
11 |
1 |
4 |
hp |
8 |
81.88 |
22.655 |
12 |
1 |
4 |
wt |
8 |
2.04 |
0.409 |
|
|
|
|
|
Frequency and contingency tables |
149 |
|
13 |
1 |
6 |
mpg |
3 |
20.57 |
0.751 |
|
14 |
1 |
6 |
hp |
3 |
131.67 |
37.528 |
|
15 |
1 |
6 |
wt |
3 |
2.75 |
0.128 |
|
16 |
1 |
8 |
mpg |
2 |
15.40 |
0.566 |
|
17 |
1 |
8 |
hp |
2 |
299.50 |
50.205 |
|
18 |
1 |
8 |
wt |
2 |
3.37 |
0.283 |
|
Personally, I find this approach the most compact and appealing. Data analysts have their own preferences for which descriptive statistics to display and how they like to see them formatted. This is probably why there are many variations available. Choose the one that works best for you, or create your own!
7.1.3Visualizing results
Numerical summaries of a distribution’s characteristics are important, but they’re no substitute for a visual representation. For quantitative variables you have histograms (section 6.3), density plots (section 6.4), box plots (section 6.5), and dot plots (section 6.6). They can provide insights that are easily missed by reliance on a small set of descriptive statistics.
The functions considered so far provide summaries of quantitative variables. The functions in the next section allow you to examine the distributions of categorical variables.
7.2Frequency and contingency tables
In this section, we’ll look at frequency and contingency tables from categorical variables, along with tests of independence, measures of association, and methods for graphically displaying results. We’ll be using functions in the basic installation, along with functions from the vcd and gmodels package. In the following examples, assume that A, B, and C represent categorical variables.
The data for this section come from the Arthritis dataset included with the vcd package. The data are from Kock & Edward (1988) and represent a double-blind clinical trial of new treatments for rheumatoid arthritis. Here are the first few observations:
>library(vcd)
>head(Arthritis)
|
ID Treatment Sex Age Improved |
|||
1 |
57 |
Treated Male |
27 |
Some |
2 |
46 |
Treated Male |
29 |
None |
3 |
77 |
Treated Male |
30 |
None |
4 |
17 |
Treated Male |
32 |
Marked |
5 |
36 |
Treated Male |
46 |
Marked |
6 |
23 |
Treated Male |
58 |
Marked |
Treatment (Placebo, Treated), Sex (Male, Female), and Improved (None, Some, Marked) are all categorical factors. In the next section, we’ll create frequency and contingency tables (cross-classifications) from the data.

150 |
CHAPTER 7 Basic statistics |
7.2.1Generating frequency tables
R provides several methods for creating frequency and contingency tables. The most important functions are listed in table 7.1.
Table 7.1 Functions for creating and manipulating contingency tables
Function |
Description |
|
|
table(var1, var2, …, varN) |
Creates an N-way contingency table from N |
|
categorical variables (factors) |
xtabs(formula, data) |
Creates an N-way contingency table based on a |
|
formula and a matrix or data frame |
prop.table(table, margins) |
Expresses table entries as fractions of the marginal |
|
table defined by the margins |
margin.table(table, margins) |
Computes the sum of table entries for a marginal |
|
table defined by the margins |
addmargins(table, margins) |
Puts summar y margins (sums by default) on a table |
ftable(table) |
Creates a compact "flat" contingency table |
|
|
In the following sections, we’ll use each of these functions to explore categorical variables. We’ll begin with simple frequencies, followed by two-way contingency tables, and end with multiway contingency tables. The first step is to create a table using either the table() or the xtabs() function, then manipulate it using the other functions.
ONE-WAY TABLES
You can generate simple frequency counts using the table() function. Here’s an example:
>mytable <- with(Arthritis, table(Improved))
>mytable
Improved
None Some Marked
42 14 28
You can turn these frequencies into proportions with prop.table():
> prop.table(mytable) Improved
None Some Marked 0.500 0.167 0.333
or into percentages, using prop.table()*100:
> prop.table(mytable)*100 Improved
None Some Marked 50.0 16.7 33.3
Frequency and contingency tables |
151 |
Here you can see that 50 percent of study participants had some or marked improvement (16.7 + 33.3).
TWO-WAY TABLES
For two-way tables, the format for the table() function is
mytable <- table(A, B)
where A is the row variable, and B is the column variable. Alternatively, the xtabs() function allows you to create a contingency table using formula style input. The format is
mytable <- xtabs(~ A + B, data=mydata)
where mydata is a matrix or data frame. In general, the variables to be cross-classified appear on the right of the formula (that is, to the right of the ~) separated by + signs. If a variable is included on the left side of the formula, it’s assumed to be a vector of frequencies (useful if the data have already been tabulated).
For the Arthritis data, you have
>mytable <- xtabs(~ Treatment+Improved, data=Arthritis)
>mytable
Improved
Treatment None Some Marked
Placebo 29 7 7
Treated 13 7 21
You can generate marginal frequencies and proportions using the margin.table() and prop.table() functions, respectively. For row sums and row proportions, you have
> margin.table(mytable, 1) Treatment
Placebo Treated
4341
>prop.table(mytable, 1) Improved
Treatment None Some Marked
Placebo 0.674 0.163 0.163
Treated 0.317 0.171 0.512
The index (1) refers to the first variable in the table() statement. Looking at the table, you can see that 51 percent of treated individuals had marked improvement, compared to 16 percent of those receiving a placebo.
For column sums and column proportions, you have
> margin.table(mytable, 2) Improved
None Some Marked 42 14 28
> prop.table(mytable, 2) Improved
Treatment None Some Marked
152 |
CHAPTER 7 Basic statistics |
Placebo 0.690 0.500 0.250
Treated 0.310 0.500 0.750
Here, the index (2) refers to the second variable in the table() statement. Cell proportions are obtained with this statement:
> prop.table(mytable) Improved
Treatment None Some Marked Placebo 0.3452 0.0833 0.0833 Treated 0.1548 0.0833 0.2500
You can use the addmargins() function to add marginal sums to these tables. For example, the following code adds a sum row and column:
> addmargins(mytable) Improved
Treatment None Some Marked Sum
Placebo |
29 |
7 |
7 |
43 |
Treated |
13 |
7 |
21 |
41 |
Sum |
42 |
14 |
28 |
84 |
> addmargins(prop.table(mytable))
|
Improved |
|
|
|
Treatment |
None |
Some |
Marked |
Sum |
Placebo 0.3452 |
0.0833 |
0.0833 |
0.5119 |
|
Treated |
0.1548 |
0.0833 |
0.2500 |
0.4881 |
Sum |
0.5000 |
0.1667 |
0.3333 |
1.0000 |
When using addmargins(), the default is to create sum margins for all variables in a table. In contrast:
> addmargins(prop.table(mytable, 1), 2) Improved
Treatment None Some Marked Sum Placebo 0.674 0.163 0.163 1.000 Treated 0.317 0.171 0.512 1.000
adds a sum column alone. Similarly,
> addmargins(prop.table(mytable, 2), 1)
|
Improved |
|
|
Treatment |
None |
Some Marked |
|
Placebo 0.690 |
0.500 |
0.250 |
|
Treated |
0.310 |
0.500 |
0.750 |
Sum |
1.000 |
1.000 |
1.000 |
adds a sum row. In the table, you see that 25 percent of those patients with marked improvement received a placebo.
NOTE The table() function ignores missing values (NAs) by default. To include NA as a valid category in the frequency counts, include the table option useNA="ifany".
A third method for creating two-way tables is the CrossTable() function in the gmodels package. The CrossTable() function produces two-way tables modeled after PROC FREQ in SAS or CROSSTABS in SPSS. See listing 7.11 for an example.

Frequency and contingency tables |
153 |
Listing 7.11 Two-way table using CrossTable
>library(gmodels)
>CrossTable(Arthritis$Treatment, Arthritis$Improved)
|
Cell Contents |
|
|
|
|
|
|
|
|
| |
------------------------- |
|
|
| |
|
|
|
|
|
| |
|
|
N |
| |
|
|
|
|
|
| Chi-square contribution |
| |
|
|
|
|
|
|||
| |
N / Row |
Total |
| |
|
|
|
|
|
|
| |
N / Col |
Total |
| |
|
|
|
|
|
|
| |
N / Table |
Total |
| |
|
|
|
|
|
|
|------------------------- |
|
|
|
| |
|
|
|
|
|
Total Observations in |
Table: 84 |
|
|
|
|
|
|||
|
|
| |
Arthritis$Improved |
|
|
|
|
||
Arthritis$Treatment |
| |
|
None | |
Some | |
Marked | Row Total | |
||||
-------------------- ----------- |
|
| |
|
|
| |
| |
----------- |
| |
-----------| |
|
Placebo |
| |
|
29 |
| |
| |
7 |
| |
43 | |
|
|
| |
|
2.616 |
| |
| |
3.752 |
| |
| |
|
|
| |
|
0.674 |
| |
| |
0.163 |
| |
0.512 | |
|
|
| |
|
0.690 |
| |
| |
0.250 |
| |
| |
|
|
| |
|
0.345 |
| |
| |
0.083 |
| |
| |
-------------------- ----------- |
|
| |
|
|
|----------- |
|----------- |
|
|----------- |
| |
|
Treated |
| |
|
13 |
| |
| |
21 |
| |
41 | |
|
|
| |
|
2.744 |
| |
| |
3.935 |
| |
| |
|
|
| |
|
0.317 |
| |
| |
0.512 |
| |
0.488 | |
|
|
| |
|
0.310 |
| |
| |
0.750 |
| |
| |
|
|
| |
|
0.155 |
| |
| |
0.250 |
| |
| |
-------------------- ----------- |
|
| |
|
|
|----------- |
|----------- |
|
|----------- |
| |
|
Column Total |
| |
|
42 |
| |
| |
28 |
| |
84 | |
|
|
| |
|
0.500 |
| |
| |
0.333 |
| |
| |
-------------------- ----------- |
|
| |
|
|
|----------- |
|----------- |
|
|----------- |
| |
The CrossTable() function has options to report percentages (row, column, cell); specify decimal places; produce chi-square, Fisher, and McNemar tests of independence; report expected and residual values (Pearson, standardized, adjusted standardized); include missing values as valid; annotate with row and column titles; and format as SAS or SPSS style output. See help(CrossTable) for details.
If you have more than two categorical variables, you’re dealing with multidimensional tables. We’ll consider these next.
MULTIDIMENSIONAL TABLES
Both table() and xtabs() can be used to generate multidimensional tables based on three or more categorical variables. The margin.table(), prop.table(), and addmargins() functions extend naturally to more than two dimensions. Additionally, the ftable() function can be used to print multidimensional tables in a compact and attractive manner. An example is given in listing 7.12.

154 |
CHAPTER 7 Basic statistics |
Listing 7.12 Three-way contingency table
>mytable <- xtabs(~ Treatment+Sex+Improved, data=Arthritis)
>mytable
, , Improved = None
Sex
Treatment Female Male
Placebo 19 10
Treated 6 7
, , Improved = Some
Sex
Treatment Female Male
Placebo 7 0
Treated 5 2
, , Improved = Marked
Sex
Treatment Female Male
Placebo 6 1
Treated 16 5
> ftable(mytable) |
|
|
|
|
|
Sex Female Male |
|
Treatment |
Improved |
|
|
Placebo |
None |
19 |
10 |
|
Some |
7 |
0 |
|
Marked |
6 |
1 |
Treated |
None |
6 |
7 |
|
Some |
5 |
2 |
|
Marked |
16 |
5 |
Cell
. frequencies
> margin.table(mytable, 1) Treatment
Placebo Treated
4341
>margin.table(mytable, 2) Sex
Female Male
5925
>margin.table(mytable, 3) Improved
None Some Marked
42 |
14 |
28 |
|
> margin.table(mytable, c(1, 3)) |
|||
|
Improved |
|
|
Treatment |
None Some Marked |
||
Placebo |
29 |
7 |
7 |
Treated |
13 |
7 |
21 |
> ftable(prop.table(mytable, c(1, 2))) |
|||
|
|
Improved None Some Marked |
|
Treatment |
Sex |
|
|
3
$
/
Marginal frequencies
Treatment x Improved marginal frequencies
Improve proportions for Treatment x Sex
|
Frequency and contingency tables |
155 |
||||
Placebo |
Female |
0.594 |
0.219 |
0.188 |
|
|
|
Male |
0.909 |
0.000 |
0.091 |
|
|
Treated |
Female |
0.222 |
0.185 |
0.593 |
|
|
|
Male |
0.500 |
0.143 |
0.357 |
|
|
> ftable(addmargins(prop.table(mytable, c(1, 2)), 3)) |
|
|||||
|
Improved |
None |
Some Marked |
Sum |
|
|
Treatment |
Sex |
|
|
|
|
|
Placebo |
Female |
0.594 |
0.219 |
0.188 |
1.000 |
|
|
Male |
0.909 |
0.000 |
0.091 |
1.000 |
|
Treated |
Female |
0.222 |
0.185 |
0.593 |
1.000 |
|
|
Male |
0.500 |
0.143 |
0.357 |
1.000 |
|
The code in . produces cell frequencies for the three-way classification. The code also demonstrates how the ftable() function can be used to print a more compact and attractive version of the table.
The code in 3produces the marginal frequencies for Treatment, Sex, and Improved. Because you created the table with the formula ~Treatement+Sex+Improve, Treatment is referred to by index 1, Sex is referred to by index 2, and Improve is referred to by index 3.
The code in $ produces the marginal frequencies for the Treatment x Improved classification, summed over Sex. The proportion of patients with None, Some, and Marked improvement for each Treatment x Sex combination is provided in /. Here you see that 36 percent of treated males had marked improvement, compared to 59 percent of treated females. In general, the proportions will add to one over the indices not included in the prop.table() call (the third index, or Improve in this case). You can see this in the last example, where you add a sum margin over the third index.
If you want percentages instead of proportions, you could multiply the resulting table by 100. For example:
ftable(addmargins(prop.table(mytable, c(1, 2)), 3)) * 100
would produce this table:
|
|
Sex Female |
Male |
Sum |
Treatment |
Improved |
|
|
|
Placebo |
None |
65.5 |
34.5 |
100.0 |
|
Some |
100.0 |
0.0 |
100.0 |
|
Marked |
85.7 |
14.3 |
100.0 |
Treated |
None |
46.2 |
53.8 |
100.0 |
|
Some |
71.4 |
28.6 |
100.0 |
|
Marked |
76.2 |
23.8 |
100.0 |
While contingency tables tell you the frequency or proportions of cases for each combination of the variables that comprise the table, you’re probably also interested in whether the variables in the table are related or independent. Tests of independence are covered in the next section.