Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный исследовательский университет «Высшая школа экономики»

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

Robert I. Kabacoff - R in action

.pdf

Скачиваний:

Добавлен:

02.06.2015

Размер:

12.13 Mб

Скачать

☆

<<< < Предыдущая 13 14 15 16 17 18 19 20 21 22 23 2425 / 4825 26 27 28 29 30 31 32 33 34 35 36 37 > Следующая >>>

216	CHAPTER 8 Regression

A new method called relative weights shows significant promise. The method closely approximates the average increase in R-square obtained by adding a predictor variable across all possible submodels (Johnson, 2004; Johnson and Lebreton, 2004; LeBreton and Tonidandel, 2008). A function for generating relative weights is provided in the next listing.

Listing 8.16 relweights() function for calculating relative importance of predictors

relweights <- function(fit,...){ R <- cor(fit$model)

nvar <- ncol(R)

rxx <- R[2:nvar, 2:nvar] rxy <- R[2:nvar, 1]

svd <- eigen(rxx) evec <- svd$vectors ev <- svd$values

delta <- diag(sqrt(ev))

lambda <- evec %*% delta %*% t(evec) lambdasq <- lambda ^ 2

beta <- solve(lambda) %*% rxy rsquare <- colSums(beta ^ 2) rawwgt <- lambdasq %*% beta ^ 2 import <- (rawwgt / rsquare) * 100 lbls <- names(fit$model[2:nvar]) rownames(import) <- lbls colnames(import) <- "Weights" barplot(t(import),names.arg=lbls,

ylab="% of R-Square", xlab="Predictor Variables",

main="Relative Importance of Predictor Variables", sub=paste("R-Square=", round(rsquare, digits=3)),

...) return(import)

}

NOTE The code in listing 8.16 is adapted from an SPSS program generously provided by Dr. Johnson. See Johnson (2000, Multivariate Behavioral Research, 35, 1–19) for an explanation of how the relative weights are derived.

In listing 8.17 the relweights() function is applied to the states data with murder rate predicted by the population, illiteracy, income, and temperature.

You can see from figure 8.19 that the total amount of variance accounted for by the model (R-square=0.567) has been divided among the predictor variables. Illiteracy accounts for 59 percent of the R-square, Frost accounts for 20.79 percent, and so forth. Based on the method of relative weights, Illiteracy has the greatest relative importance, followed by Frost, Population, and Income, in that order.

Taking the analysis further

217

Listing 8.17 Applying the relweights() function

>fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states)

>relweights(fit, col="lightgrey")

Weights

Population 14.72

Illiteracy 59.00

Income 5.49

Frost 20.79

Relative importance measures (and in particular, the method of relative weights) have wide applicability. They come much closer to our intuitive conception of relative importance than standardized regression coefficients do, and I expect to see their use increase dramatically in coming years.

Relative Importance of Predictor Variables

% of R−Square 0 10 20 30 40 50

Population

Illiteracy

Income

Frost

Predictor Variables

R−Square = 0.567

Figure 8.19 Bar plot of relative weights for the states multiple regression problem

218	CHAPTER 8 Regression

8.8Summary

Regression analysis is a term that covers a broad range of methodologies in statistics. You’ve seen that it’s a highly interactive approach that involves fitting models, assessing their fit to statistical assumptions, modifying both the data and the models, and refitting to arrive at a final result. In many ways, this final result is based on art and skill as much as science.

This has been a long chapter, because regression analysis is a process with many parts. We’ve discussed fitting OLS regression models, using regression diagnostics to assess the data’s fit to statistical assumptions, and methods for modifying the data to meet these assumptions more closely. We looked at ways of selecting a final regression model from many possible models, and you learned how to evaluate its likely performance on new samples of data. Finally, we tackled the thorny problem of variable importance: identifying which variables are the most important for predicting an outcome.

In each of the examples in this chapter, the predictor variables have been quantitative. However, there are no restrictions against using categorical variables as predictors as well. Using categorical predictors such as gender, treatment type, or manufacturing process allows you to examine group differences on a response or outcome variable. This is the focus of our next chapter.

Analysis of9variance

This chapter covers

Using R to model basic experimental designs Fitting and interpreting ANOVA type models Evaluating model assumptions

In chapter 7, we looked at regression models for predicting a quantitative response variable from quantitative predictor variables. But there’s no reason that we couldn’t have included nominal or ordinal factors as predictors as well. When factors are included as explanatory variables, our focus usually shifts from prediction to understanding group differences, and the methodology is referred to as analysis of variance (ANOVA). ANOVA methodology is used to analyze a wide variety of experimental and quasi-experimental designs. This chapter provides an overview of R functions for analyzing common research designs.

First we’ll look at design terminology, followed by a general discussion of R’s approach to fitting ANOVA models. Then we’ll explore several examples that illustrate the analysis of common designs. Along the way, we’ll treat anxiety disorders, lower blood cholesterol levels, help pregnant mice have fat babies, assure that pigs grow long in the tooth, facilitate breathing in plants, and learn which grocery shelves to avoid.

In addition to the base installation, we’ll be using the car, gplots, HH, rrcov, and mvoutlier packages in our examples. Be sure to install them before trying out the sample code.

219

220	CHAPTER 9 Analysis of variance

9.1A crash course on terminology

Experimental design in general, and analysis of variance in particular, has its own language. Before discussing the analysis of these designs, we’ll quickly review some important terms. We’ll use a series of increasingly complex study designs to introduce the most significant concepts.

Say you’re interested in studying the treatment of anxiety. Two popular therapies for anxiety are cognitive behavior therapy (CBT) and eye movement desensitization and reprocessing (EMDR). You recruit 10 anxious individuals and randomly assign half of them to receive five weeks of CBT and half to receive five weeks of EMDR. At the conclusion of therapy, each patient is asked to complete the State-Trait Anxiety

Inventory (STAI), a self-report measure of	Table 9.1 One-way between-groups ANOVA
anxiety. The design is outlined in table 9.1.	Table 9.1 One-way between-groups ANOVA
anxiety. The design is outlined in table 9.1.
In this design, Treatment is a between-groups		Treatment
factor with two levels (CBT, EMDR). It’s called
factor with two levels (CBT, EMDR). It’s called	CBT			EMDR
a between-groups factor because patients	CBT			EMDR
a between-groups factor because patients
are assigned to one and only one group. No	s1			s6
patient receives both CBT and EMDR. The s	s2			s7
characters represent the subjects (patients).	s2			s7
characters represent the subjects (patients).
STAI is the dependent variable, and Treatment	s3			s8
is the independent variable. Because there are	s4			s9
an equal number of observations in each	s4			s9
an equal number of observations in each	s5			s10
treatment condition, you have a balanced	s5			s10
treatment condition, you have a balanced
design. When the sample sizes are unequal
across the cells of a design, you have an	Table 9.2 One-way within-groups ANOVA
unbalanced design.	Table 9.2 One-way within-groups ANOVA
unbalanced design.
The statistical design in table 9.1 is called			Time
a one-way ANOVA because there’s a single
a one-way ANOVA because there’s a single	Patient	5 weeks		6 months
classification variable. Specifically, it’s a one-	Patient	5 weeks		6 months


way between-groups ANOVA. Effects in ANOVA	s1
designs are primarily evaluated through F tests.	s2
If the F test for Treatment is significant, you	s2
If the F test for Treatment is significant, you	s3
can conclude that the mean STAI scores for two	s3
can conclude that the mean STAI scores for two
therapies differed after five weeks of treatment.	s4
If you were interested in the effect of CBT	s5
on anxiety over time, you could place all 10	s5
on anxiety over time, you could place all 10
patients in the CBT group and assess them at	s6
the conclusion of therapy and again six months	s7
later. This design is displayed in table 9.2.	s7
later. This design is displayed in table 9.2.
Time is a within-groups factor with two	s8
levels (five weeks, six months). It’s called a	s9
within-groups factor because each patient is	s10
measured under both levels. The statistical	s10
measured under both levels. The statistical

A crash course on terminology

221

design is a one-way within-groups ANOVA. Because each subject is measured more than once, the design is also called repeated measures ANOVA. If the F test for Time is significant, you can conclude that patients’ mean STAI scores changed between five weeks and six months.

If you were interested in both treatment differences and change over time, you could combine the first two study designs, and randomly assign five patients to CBT and five patients to EMDR, and assess their STAI results at the end of therapy (five weeks) and at six months (see table 9.3).

By including both Therapy and Time as factors, you’re able to examine the impact of Therapy (averaged across time), Time (averaged across therapy type), and the interaction of Therapy and Time. The first two are called the main effects, whereas the interaction is (not surprisingly) called an interaction effect.

When you cross two or more factors, as you’ve done here, you have a factorial ANOVA design. Crossing two factors produces a two-way ANOVA, crossing three factors produces a three-way ANOVA, and so forth. When a factorial design includes both between-groups and within-groups factors, it’s also called a mixed-model ANOVA. The current design is a two-way mixed-model factorial ANOVA (phew!).

In this case you’ll have three F tests: one for Therapy, one for Time, and one for the Therapy x Time interaction. A significant result for Therapy indicates that CBT and EMDR differ in their impact on anxiety. A significant result for Time indicates that

Table 9.3 Two-way factorial ANOVA with one between-groups and one within-groups factor

				Time

		Patient	5 weeks		6 months

		s1
		s2
	CBT	s3
		s4
Therapy		s5

		s6
		s6
		s7
	EMDR	s8
		s9
		s10

222	CHAPTER 9 Analysis of variance

anxiety changed from week five to the six month follow-up. A significant Therapy x Time interaction indicates that the two treatments for anxiety had a differential impact over time (that is, the change in anxiety from five weeks to six months was different for the two treatments).

Now let’s extend the design a bit. It’s known that depression can have an impact on therapy, and that depression and anxiety often co-occur. Even though subjects were randomly assigned to treatment conditions, it’s possible that the two therapy groups differed in patient depression levels at the initiation of the study. Any posttherapy differences might then be due to the preexisting depression differences and not to your experimental manipulation. Because depression could also explain the group differences on the dependent variable, it’s a confounding factor. And because you’re not interested in depression, it’s called a nuisance variable.

If you recorded depression levels using a self-report depression measure such as the Beck Depression Inventory (BDI) when patients were recruited, you could statistically adjust for any treatment group differences in depression before assessing the impact of therapy type. In this case, BDI would be called a covariate, and the design would be called an analysis of covariance (ANCOVA).

Finally, you’ve recorded a single dependent variable in this study (the STAI). You could increase the validity of this study by including additional measures of anxiety (such as family ratings, therapist ratings, and a measure assessing the impact of anxiety on their daily functioning). When there’s more than one dependent variable, the design is called a multivariate analysis of variance (MANOVA). If there are covariates present, it’s called a multivariate analysis of covariance (MANCOVA).

Now that you have the basic terminology under your belt, you’re ready to amaze your friends, dazzle new acquaintances, and discuss how to fit ANOVA/ANCOVA/ MANOVA models with R.

9.2Fitting ANOVA models

Although ANOVA and regression methodologies developed separately, functionally they’re both special cases of the general linear model. We could analyze ANOVA models using the same lm() function used for regression in chapter 7. However, we’ll primarily use the aov() function in this chapter. The results of lm() and aov() are equivalent, but the aov() function presents these results in a format that’s more familiar to ANOVA methodologists. For completeness, I’ll provide an example using lm()at the end of this chapter.

9.2.1The aov() function

The syntax of the aov() function is aov(formula, data=dataframe). Table 9.4 describes special symbols that can be used in the formulas. In this table, y is the dependent variable and the letters A, B, and C represent factors.

Fitting ANOVA models

223

Table 9.4 Special symbols used in R formulas

Symbol

Usage

~Separates response variables on the left from the explanator y variables on the right. For

example, a prediction of y from A, B, and C would be coded y ~ A + B + C.

+Separates explanator y variables.

:Denotes an interaction between variables. A prediction of y from A, B, and the

interaction between A and B would be coded y ~ A + B + A:B.

*Denotes the complete crossing variables. The code y ~ A*B*C expands to y ~ A + B + C + A:B + A:C + B:C + A:B:C.

^Denotes crossing to a specified degree. The code y ~ (A+B+C)^2 expands to y ~ A + B + C + A:B + A:C + A:B.

.	A place holder for all other variables in the data frame except the dependent variable. For
	example, if a data frame contained the variables y, A, B, and C, then the code y ~ .
	would expand to y ~ A + B + C.

Table 9.5 provides formulas for several common research designs. In this table, lowercase letters are quantitative variables, uppercase letters are grouping factors, and Subject is a unique identifier variable for subjects.

Table 9.5 Formulas for common research designs

Design	Formula

One-way ANOVA	y ~ A
One-way ANCOVA with one covariate	y ~ x + A
Two-way Factorial ANOVA	y ~ A * B
Two-way Factorial ANCOVA with two covariates	y ~ x1 + x2 + A * B
Randomized Block	y ~ B + A (where B is a blocking factor)
One-way within-groups ANOVA	y ~ A + Error(Subject/A)
Repeated measures ANOVA with one within-groups	y ~ B * W + Error(Subject/W)
factor (W) and one between-groups factor (B)

We’ll explore in-depth examples of several of these designs later in this chapter.

9.2.2The order of formula terms

The order in which the effects appear in a formula matters when (a) there’s more than one factor and the design is unbalanced, or (b) covariates are present. When either of these two conditions is present, the variables on the right side of the equation will be correlated with each other. In this case, there’s no unambiguous way to divide up their impact on the dependent variable.

224 CHAPTER 9 Analysis of variance

For example, in a two-way ANOVA with unequal numbers of observations in the treatment combinations, the model y ~ A*B will not produce the same results as the model y ~ B*A.

By default, R employs the Type I (sequential) approach to calculating ANOVA effects (see the sidebar “Order counts!”). The first model can be written out as y ~ A + B + A:B. The resulting R ANOVA table will assess

The impact of A on y

The impact of B on y, controlling for A

The interaction of A and B, controlling for the A and B main effects

Order counts!

When independent variables are correlated with each other or with covariates, there’s no unambiguous method for assessing the independent contributions of these variables to the dependent variable. Consider an unbalanced two-way factorial design with factors A and B and dependent variable y. There are three effects in this design: the A and B main effects and the A x B interaction. Assuming that you’re modeling the data using the formula

Y ~ A + B + A:B

there are three typical approaches for par titioning the variance in y among the effects on the right side of this equation.

Type I (sequential)

Effects are adjusted for those that appear earlier in the formula. A is unadjusted. B is adjusted for the A. The A:B interaction is adjusted for A and B.

Type II (hierarchical)

Effects are adjusted for other effects at the same or lower level. A is adjusted for B. B is adjusted for A. The A:B interaction is adjusted for both A and B.

Type III (marginal)

Each effect is adjusted for ever y other effect in the model. A is adjusted for B and A:B. B is adjusted for A and A:B. The A:B interaction is adjusted for A and B.

R employs the Type I approach by default. Other programs such as SAS and SPSS employ the Type III approach by default.

The greater the imbalance in sample sizes, the greater the impact that the order of the terms will have on the results. In general, more fundamental effects should be listed earlier in the formula. In particular, covariates should be listed first, followed by main effects, followed by two-way interactions, followed by three-way interactions, and so on. For main effects, more fundamental variables should be listed first. Thus gender would be listed before treatment. Here’s the bottom line: When the research design isn’t orthogonal (that is, when the factors and/or covariates are correlated), be careful when specifying the order of effects.

One-way ANOVA

225

Before moving on to specific examples, note that the Anova() function in the car package (not to be confused with the standard anova() function) provides the option of using the Type II or Type III approach, rather than the Type I approach used by the aov() function. You may want to use the Anova() function if you’re concerned about matching your results to those provided by other packages such as SAS and SPSS. See help(Anova, package="car") for details.

9.3One-way ANOVA

In a one-way ANOVA, you’re interested in comparing the dependent variable means of two or more groups defined by a categorical grouping factor. Our example comes from the cholesterol dataset in the multcomp package, and taken from Westfall, Tobia, Rom, & Hochberg (1999). Fifty patients received one of five cholesterol-reducing drug regiments (trt). Three of the treatment conditions involved the same drug administered as 20 mg once per day (1time), 10mg twice per day (2times), or 5 mg four times per day (4times). The two remaining conditions (drugD and drugE) represented competing drugs. Which drug regimen produced the greatest cholesterol reduction (response)? The analysis is provided in the following listing.

Listing 9.1 One-way ANOVA

>library(multcomp)

>attach(cholesterol)

> table(trt)						Group sample sizes
> table(trt)						Group sample sizes
trt
	1time 2times 4times			drugD	drugE
	10	10	10	10	10
> aggregate(response, by=list(trt), FUN=mean)							Group means
> aggregate(response, by=list(trt), FUN=mean)							Group means
	Group.1	x
1	1time	5.78
2	2times	9.22

34times 12.37

4drugD 15.36

5drugE 20.95

> aggregate(response, by=list(trt), FUN=sd) Group.1 x

11time 2.88

22times 3.48

34times 2.92

4drugD 3.45

5drugE 3.35

>fit <- aov(response ~ trt)

>summary(fit)

	Df Sum Sq		Mean Sq	F value
trt	4	1351	338	32.4
Residuals	45	469	10
---

Group standarddeviations

Test for group differences (ANOVA)

Pr(>F) 9.8e-13 ***

<<< < Предыдущая 13 14 15 16 17 18 19 20 21 22 23 2425 / 4825 26 27 28 29 30 31 32 33 34 35 36 37 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
26.03.20161.55 Mб14report.doc
#
04.09.2019123.9 Кб2report_praktika.doc
#
02.06.201534.78 Кб20Research_Proposal_v_3_0.docx
#
02.06.2015613.89 Кб15Rimskoe_pravo_bilety.doc
#
10.11.2019295.94 Кб8RI_lab.doc
#
02.06.201512.13 Mб88Robert I. Kabacoff - R in action.pdf
#
02.06.20152.89 Mб33Rossyskoe_zakonodatelstvo_X_XX_vekov_V_9-ti.doc
#
24.09.20195.23 Mб47RPZ.doc
#
26.03.2016112.64 Кб2Rules.doc
#
26.03.2016233.33 Кб124RUR2012.docx
#
26.03.2016355.13 Кб5Russia2013.pdf