Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный исследовательский университет «Высшая школа экономики»

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

Robert I. Kabacoff - R in action

.pdf

Скачиваний:

Добавлен:

02.06.2015

Размер:

12.13 Mб

Скачать

☆

<<< < Предыдущая 6 7 8 9 10 11 12 13 14 15 16 1718 / 4818 19 20 21 22 23 24 25 26 27 28 29 30 > Следующая >>>

146	CHAPTER 7 Basic statistics

You have to give R more information to find it.

Now that you know how to generate descriptive statistics for the data as a whole, let’s review how to obtain statistics for subgroups of the data.

7.1.2Descriptive statistics by group

When comparing groups of individuals or observations, the focus is usually on the descriptive statistics of each group, rather than the total sample. Again, there are several ways to accomplish this in R. We’ll start by getting descriptive statistics for each level of transmission type.

In chapter 5, we discussed methods of aggregating data. You can use the aggregate() function (section 5.6.2) to obtain descriptive statistics by group, as shown in the following listing.

Listing 7.6 Descriptive statistics by group using aggregate()

> aggregate(mtcars[vars], by=list(am=mtcars$am), mean) am mpg hp wt

10 17.1 160 3.77

21 24.4 127 2.41

> aggregate(mtcars[vars], by=list(am=mtcars$am), sd) am mpg hp wt

10 3.83 53.9 0.777

21 6.17 84.1 0.617

Note the use of list(am=mtcars$am). If you had used list(mtcars$am), the am column would have been labeled Group.1 rather than am. You use the assignment to provide a more useful column label. If you have more than one grouping variable, you can use code like by=list(name1=groupvar1, name2=groupvar2, … , groupvarN).

Unfortunately, aggregate() only allows you to use single value functions such as mean, standard deviation, and the like in each call. It won’t return several statistics at once. For that task, you can use the by() function. The format is

by(data, INDICES, FUN)

where data is a data frame or matrix, INDICES is a factor or list of factors that define the groups, and FUN is an arbitrary function. This next listing provides an example.

Listing 7.7 Descriptive statistics by group using by()

>dstats <- function(x)(c(mean=mean(x), sd=sd(x)))

>by(mtcars[vars], mtcars$am, dstats)

mtcars$am: 0
mean.mpg	mean.hp	mean.wt	sd.mpg	sd.hp	sd.wt
17.147	160.263	3.769	3.834	53.908	0.777
------------------------------------------------
mtcars$am: 1
mean.mpg	mean.hp	mean.wt	sd.mpg	sd.hp	sd.wt
24.392	126.846	2.411	6.167	84.062	0.617

Descriptive statistics

147

EXTENSIONS

The doBy package and the psych package also provide functions for descriptive statistics by group. Again, they aren’t distributed in the base installation and must be installed before first use. The summaryBy() function in the doBy package has the format

summaryBy(formula, data=dataframe, FUN=function)

where the formula takes the form

var1 + var2 + var3 + ... + varN ~ groupvar1 + groupvar2 + … + groupvarN

Variables on the left of the ~ are the numeric variables to be analyzed and variables on the right are categorical grouping variables. The function can be any built-in or usercreated R function. An example using the mystats() function you created in section 7.2.1 is shown in the following listing.

Listing 7.8 Summary statistics by group using summaryBy() in the doBy package

>library(doBy)

>summaryBy(mpg+hp+wt~am, data=mtcars, FUN=mystats)

am mpg.n mpg.mean mpg.stdev mpg.skew mpg.kurtosis hp.n hp.mean hp.stdev

1	0	19	17.1	3.83	0.0140		-0.803	19	160	53.9
2	1	13	24.4	6.17	0.0526		-1.455	13	127	84.1
	hp.skew hp.kurtosis wt.n wt.mean wt.stdev wt.skew wt.kurtosis
1	-0.0142		-1.210	19	3.77	0.777	0.976		0.142
2	1.3599		0.563	13	2.41	0.617	0.210		-1.174

The describe.by() function contained in the psych package provides the same descriptive statistics as describe, stratified by one or more grouping variables, as you can see in the following listing.

Listing 7.9 Summary statistics by group using describe.by() in the psych package

>library(psych)

>describe.by(mtcars[vars], mtcars$am) group: 0

	var	n	mean	sd	median trimmed			mad	min	max
mpg	1 19		17.15	3.83		17.30	17.12	3.11	10.40	24.40
hp	2 19		160.26	53.91	175.00		161.06	77.10	62.00	245.00
wt	3 19		3.77	0.78		3.52	3.75	0.45	2.46	5.42
	range		skew kurtosis			se
mpg	14.00		0.01	-0.80		0.88
hp	183.00		-0.01	-1.21		12.37
wt	2.96		0.98	0.14		0.18
------------------------------------------------
group: 1
	var	n	mean	sd	median trimmed			mad	min	max
mpg	1 13		24.39	6.17		22.80	24.38	6.67	15.00	33.90
hp	2 13		126.85	84.06	109.00		114.73	63.75	52.00	335.00
wt	3 13		2.41	0.62		2.32	2.39	0.68	1.51	3.57
	range		skew kurtosis			se

148				CHAPTER 7 Basic statistics
mpg	18.90	0.05	-1.46	1.71
hp	283.00	1.36	0.56	23.31
wt	2.06	0.21	-1.17	0.17

Unlike the previous example, the describe.by() function doesn’t allow you to specify an arbitrary function, so it’s less generally applicable. If there’s more than one grouping variable, you can write them as list(groupvar1, groupvar2, … , groupvarN). But this will only work if there are no empty cells when the grouping variables are crossed.

Finally, you can use the reshape package described in section 5.6.3 to derive descriptive statistics by group in a flexible way. (If you haven’t read that section, I suggest you review it before continuing.) First, you melt the data frame using

dfm <- melt(dataframe, measure.vars=y, id.vars=g)

where dataframe contains the data, y is a vector indicating the numeric variables to be summarized (the default is to use all), and g is a vector of one or more grouping variables. You then cast the data using

cast(dfm, groupvar1 + groupvar2 + … + variable ~ ., FUN)

where the grouping variables are separated by + signs, the word variable is entered exactly as is, and FUN is an arbitrary function.

In the final example of this section, we’ll apply the reshape approach to obtaining descriptive statistics for each subgroup formed by transmission type and number of cylinders. For descriptive statistics, we’ll get the sample size, mean, and standard deviation. The code and results are shown in the following listing.

Listing 7.10 Summary statistics by group via the reshape package

>library(reshape)

>dstats <- function(x)(c(n=length(x), mean=mean(x), sd=sd(x)))

>dfm <- melt(mtcars, measure.vars=c("mpg", "hp", "wt"),

id.vars=c("am", "cyl"))

> cast(dfm, am + cyl + variable ~ ., dstats)

	am cyl variable			n	mean	sd
1	0	4	mpg	3	22.90	1.453
2	0	4	hp	3	84.67	19.655
3	0	4	wt	3	2.94	0.408
4	0	6	mpg	4	19.12	1.632
5	0	6	hp	4	115.25	9.179
6	0	6	wt	4	3.39	0.116
7	0	8	mpg	12	15.05	2.774
8	0	8	hp	12	194.17	33.360
9	0	8	wt	12	4.10	0.768
10	1	4	mpg	8	28.07	4.484
11	1	4	hp	8	81.88	22.655
12	1	4	wt	8	2.04	0.409

					Frequency and contingency tables		149
13	1	6	mpg	3	20.57	0.751
14	1	6	hp	3	131.67	37.528
15	1	6	wt	3	2.75	0.128
16	1	8	mpg	2	15.40	0.566
17	1	8	hp	2	299.50	50.205
18	1	8	wt	2	3.37	0.283

Personally, I find this approach the most compact and appealing. Data analysts have their own preferences for which descriptive statistics to display and how they like to see them formatted. This is probably why there are many variations available. Choose the one that works best for you, or create your own!

7.1.3Visualizing results

Numerical summaries of a distribution’s characteristics are important, but they’re no substitute for a visual representation. For quantitative variables you have histograms (section 6.3), density plots (section 6.4), box plots (section 6.5), and dot plots (section 6.6). They can provide insights that are easily missed by reliance on a small set of descriptive statistics.

The functions considered so far provide summaries of quantitative variables. The functions in the next section allow you to examine the distributions of categorical variables.

7.2Frequency and contingency tables

In this section, we’ll look at frequency and contingency tables from categorical variables, along with tests of independence, measures of association, and methods for graphically displaying results. We’ll be using functions in the basic installation, along with functions from the vcd and gmodels package. In the following examples, assume that A, B, and C represent categorical variables.

The data for this section come from the Arthritis dataset included with the vcd package. The data are from Kock & Edward (1988) and represent a double-blind clinical trial of new treatments for rheumatoid arthritis. Here are the first few observations:

>library(vcd)

>head(Arthritis)

	ID Treatment Sex Age Improved
1	57	Treated Male	27	Some
2	46	Treated Male	29	None
3	77	Treated Male	30	None
4	17	Treated Male	32	Marked
5	36	Treated Male	46	Marked
6	23	Treated Male	58	Marked

Treatment (Placebo, Treated), Sex (Male, Female), and Improved (None, Some, Marked) are all categorical factors. In the next section, we’ll create frequency and contingency tables (cross-classifications) from the data.

150	CHAPTER 7 Basic statistics

7.2.1Generating frequency tables

R provides several methods for creating frequency and contingency tables. The most important functions are listed in table 7.1.

Table 7.1 Functions for creating and manipulating contingency tables

Function	Description

table(var1, var2, …, varN)	Creates an N-way contingency table from N
	categorical variables (factors)
xtabs(formula, data)	Creates an N-way contingency table based on a
	formula and a matrix or data frame
prop.table(table, margins)	Expresses table entries as fractions of the marginal
	table defined by the margins
margin.table(table, margins)	Computes the sum of table entries for a marginal
	table defined by the margins
addmargins(table, margins)	Puts summar y margins (sums by default) on a table
ftable(table)	Creates a compact "flat" contingency table

In the following sections, we’ll use each of these functions to explore categorical variables. We’ll begin with simple frequencies, followed by two-way contingency tables, and end with multiway contingency tables. The first step is to create a table using either the table() or the xtabs() function, then manipulate it using the other functions.

ONE-WAY TABLES

You can generate simple frequency counts using the table() function. Here’s an example:

>mytable <- with(Arthritis, table(Improved))

>mytable

Improved

None Some Marked

42 14 28

You can turn these frequencies into proportions with prop.table():

> prop.table(mytable) Improved

None Some Marked 0.500 0.167 0.333

or into percentages, using prop.table()*100:

> prop.table(mytable)*100 Improved

None Some Marked 50.0 16.7 33.3

Frequency and contingency tables

151

Here you can see that 50 percent of study participants had some or marked improvement (16.7 + 33.3).

TWO-WAY TABLES

For two-way tables, the format for the table() function is

mytable <- table(A, B)

where A is the row variable, and B is the column variable. Alternatively, the xtabs() function allows you to create a contingency table using formula style input. The format is

mytable <- xtabs(~ A + B, data=mydata)

where mydata is a matrix or data frame. In general, the variables to be cross-classified appear on the right of the formula (that is, to the right of the ~) separated by + signs. If a variable is included on the left side of the formula, it’s assumed to be a vector of frequencies (useful if the data have already been tabulated).

For the Arthritis data, you have

>mytable <- xtabs(~ Treatment+Improved, data=Arthritis)

>mytable

Improved

Treatment None Some Marked

Placebo 29 7 7

Treated 13 7 21

You can generate marginal frequencies and proportions using the margin.table() and prop.table() functions, respectively. For row sums and row proportions, you have

> margin.table(mytable, 1) Treatment

Placebo Treated

4341

>prop.table(mytable, 1) Improved

Treatment None Some Marked

Placebo 0.674 0.163 0.163

Treated 0.317 0.171 0.512

The index (1) refers to the first variable in the table() statement. Looking at the table, you can see that 51 percent of treated individuals had marked improvement, compared to 16 percent of those receiving a placebo.

For column sums and column proportions, you have

> margin.table(mytable, 2) Improved

None Some Marked 42 14 28

> prop.table(mytable, 2) Improved

Treatment None Some Marked

152	CHAPTER 7 Basic statistics

Placebo 0.690 0.500 0.250

Treated 0.310 0.500 0.750

Here, the index (2) refers to the second variable in the table() statement. Cell proportions are obtained with this statement:

> prop.table(mytable) Improved

Treatment None Some Marked Placebo 0.3452 0.0833 0.0833 Treated 0.1548 0.0833 0.2500

You can use the addmargins() function to add marginal sums to these tables. For example, the following code adds a sum row and column:

> addmargins(mytable) Improved

Treatment None Some Marked Sum

Placebo	29	7	7	43
Treated	13	7	21	41
Sum	42	14	28	84

> addmargins(prop.table(mytable))

	Improved
Treatment	None	Some	Marked	Sum
Placebo 0.3452		0.0833	0.0833	0.5119
Treated	0.1548	0.0833	0.2500	0.4881
Sum	0.5000	0.1667	0.3333	1.0000

When using addmargins(), the default is to create sum margins for all variables in a table. In contrast:

> addmargins(prop.table(mytable, 1), 2) Improved

Treatment None Some Marked Sum Placebo 0.674 0.163 0.163 1.000 Treated 0.317 0.171 0.512 1.000

adds a sum column alone. Similarly,

> addmargins(prop.table(mytable, 2), 1)

	Improved
Treatment	None	Some Marked
Placebo 0.690		0.500	0.250
Treated	0.310	0.500	0.750
Sum	1.000	1.000	1.000

adds a sum row. In the table, you see that 25 percent of those patients with marked improvement received a placebo.

NOTE The table() function ignores missing values (NAs) by default. To include NA as a valid category in the frequency counts, include the table option useNA="ifany".

A third method for creating two-way tables is the CrossTable() function in the gmodels package. The CrossTable() function produces two-way tables modeled after PROC FREQ in SAS or CROSSTABS in SPSS. See listing 7.11 for an example.

Frequency and contingency tables

153

Listing 7.11 Two-way table using CrossTable

>library(gmodels)

>CrossTable(Arthritis$Treatment, Arthritis$Improved)

	Cell Contents
\|	-------------------------			\|
\|			N	\|
\| Chi-square contribution				\|
\|	N / Row	Total		\|
\|	N / Col	Total		\|
\|	N / Table	Total		\|
\|-------------------------				\|
Total Observations in			Table: 84
		\|	Arthritis$Improved
Arthritis$Treatment		\|		None \|		Some \|	Marked \| Row Total \|
-------------------- -----------		\|			\|	\|	-----------	\|	-----------\|
	Placebo	\|		29	\|	\|	7	\|	43 \|
		\|		2.616	\|	\|	3.752	\|	\|
		\|		0.674	\|	\|	0.163	\|	0.512 \|
		\|		0.690	\|	\|	0.250	\|	\|
		\|		0.345	\|	\|	0.083	\|	\|
-------------------- -----------		\|			\|-----------	\|-----------		\|-----------	\|
	Treated	\|		13	\|	\|	21	\|	41 \|
		\|		2.744	\|	\|	3.935	\|	\|
		\|		0.317	\|	\|	0.512	\|	0.488 \|
		\|		0.310	\|	\|	0.750	\|	\|
		\|		0.155	\|	\|	0.250	\|	\|
-------------------- -----------		\|			\|-----------	\|-----------		\|-----------	\|
	Column Total	\|		42	\|	\|	28	\|	84 \|
		\|		0.500	\|	\|	0.333	\|	\|
-------------------- -----------		\|			\|-----------	\|-----------		\|-----------	\|

The CrossTable() function has options to report percentages (row, column, cell); specify decimal places; produce chi-square, Fisher, and McNemar tests of independence; report expected and residual values (Pearson, standardized, adjusted standardized); include missing values as valid; annotate with row and column titles; and format as SAS or SPSS style output. See help(CrossTable) for details.

If you have more than two categorical variables, you’re dealing with multidimensional tables. We’ll consider these next.

MULTIDIMENSIONAL TABLES

Both table() and xtabs() can be used to generate multidimensional tables based on three or more categorical variables. The margin.table(), prop.table(), and addmargins() functions extend naturally to more than two dimensions. Additionally, the ftable() function can be used to print multidimensional tables in a compact and attractive manner. An example is given in listing 7.12.

154	CHAPTER 7 Basic statistics

Listing 7.12 Three-way contingency table

>mytable <- xtabs(~ Treatment+Sex+Improved, data=Arthritis)

>mytable

, , Improved = None

Sex

Treatment Female Male

Placebo 19 10

Treated 6 7

, , Improved = Some

Sex

Treatment Female Male

Placebo 7 0

Treated 5 2

, , Improved = Marked

Sex

Treatment Female Male

Placebo 6 1

Treated 16 5

> ftable(mytable)
		Sex Female Male
Treatment	Improved
Placebo	None	19	10
	Some	7	0
	Marked	6	1
Treated	None	6	7
	Some	5	2
	Marked	16	5

Cell

. frequencies

> margin.table(mytable, 1) Treatment

Placebo Treated

4341

>margin.table(mytable, 2) Sex

Female Male

5925

>margin.table(mytable, 3) Improved

None Some Marked

42	14	28
> margin.table(mytable, c(1, 3))
	Improved
Treatment	None Some Marked
Placebo	29	7	7
Treated	13	7	21
> ftable(prop.table(mytable, c(1, 2)))
		Improved None Some Marked
Treatment	Sex

Marginal frequencies

Treatment x Improved marginal frequencies

Improve proportions for Treatment x Sex

	Frequency and contingency tables					155
Placebo	Female	0.594	0.219	0.188
	Male	0.909	0.000	0.091
Treated	Female	0.222	0.185	0.593
	Male	0.500	0.143	0.357
> ftable(addmargins(prop.table(mytable, c(1, 2)), 3))
	Improved	None	Some Marked		Sum
Treatment	Sex
Placebo	Female	0.594	0.219	0.188	1.000
	Male	0.909	0.000	0.091	1.000
Treated	Female	0.222	0.185	0.593	1.000
	Male	0.500	0.143	0.357	1.000

The code in . produces cell frequencies for the three-way classification. The code also demonstrates how the ftable() function can be used to print a more compact and attractive version of the table.

The code in 3produces the marginal frequencies for Treatment, Sex, and Improved. Because you created the table with the formula ~Treatement+Sex+Improve, Treatment is referred to by index 1, Sex is referred to by index 2, and Improve is referred to by index 3.

The code in $ produces the marginal frequencies for the Treatment x Improved classification, summed over Sex. The proportion of patients with None, Some, and Marked improvement for each Treatment x Sex combination is provided in /. Here you see that 36 percent of treated males had marked improvement, compared to 59 percent of treated females. In general, the proportions will add to one over the indices not included in the prop.table() call (the third index, or Improve in this case). You can see this in the last example, where you add a sum margin over the third index.

If you want percentages instead of proportions, you could multiply the resulting table by 100. For example:

ftable(addmargins(prop.table(mytable, c(1, 2)), 3)) * 100

would produce this table:

		Sex Female	Male	Sum
Treatment	Improved
Placebo	None	65.5	34.5	100.0
	Some	100.0	0.0	100.0
	Marked	85.7	14.3	100.0
Treated	None	46.2	53.8	100.0
	Some	71.4	28.6	100.0
	Marked	76.2	23.8	100.0

While contingency tables tell you the frequency or proportions of cases for each combination of the variables that comprise the table, you’re probably also interested in whether the variables in the table are related or independent. Tests of independence are covered in the next section.

<<< < Предыдущая 6 7 8 9 10 11 12 13 14 15 16 1718 / 4818 19 20 21 22 23 24 25 26 27 28 29 30 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
04.09.2019123.9 Кб2report_praktika.doc
#
02.06.201534.78 Кб27Research_Proposal_v_3_0.docx
#
01.05.202564.51 Кб1revision.doc
#
02.06.2015613.89 Кб24Rimskoe_pravo_bilety.doc
#
10.11.2019295.94 Кб12RI_lab.doc
#
02.06.201512.13 Mб97Robert I. Kabacoff - R in action.pdf
#
02.06.20152.89 Mб37Rossyskoe_zakonodatelstvo_X_XX_vekov_V_9-ti.doc
#
24.09.20195.23 Mб67RPZ.doc
#
01.05.2025136.7 Кб0RP_NIR_MEI_FM.doc
#
26.03.2016112.64 Кб4Rules.doc
#
26.03.2016233.33 Кб135RUR2012.docx