Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный исследовательский университет «Высшая школа экономики»

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

R in Action, Second Edition.pdf

Скачиваний:

540

Добавлен:

26.03.2016

Размер:

20.33 Mб

Скачать

☆

<<< < Предыдущая 87 88 89 90 91 92 93 94 95 96 97 9899 / 17399 100 101 102 103 104 105 106 107 108 109 110 111 > Следующая >>>

312	CHAPTER 13 Generalized linear models

The ability to model a response variable with multiple categories (both ordered and unordered) is an important extension, but it comes at the expense of greater interpretive complexity. Assessing model fit and regression diagnostics in these cases will also be more complex.

In the Affairs example, the number of extramarital contacts was dichotomized into a yes/no response variable because our interest centered on whether respondents had an affair in the past year. If our interest had been centered on magnitude— the number of encounters in the past year—we would have analyzed the count data directly. One popular approach to analyzing count data is Poisson regression, the next topic we’ll address.

13.3 Poisson regression

Poisson regression is useful when you’re predicting an outcome variable representing counts from a set of continuous and/or categorical predictor variables. A comprehensive yet accessible introduction to Poisson regression is provided by Coxe, West, and Aiken (2009).

To illustrate the fitting of a Poisson regression model, along with some issues that can come up in the analysis, we’ll use the Breslow seizure data (Breslow, 1993) provided in the robust package. Specifically, we’ll consider the impact of an antiepileptic drug treatment on the number of seizures occurring over an eight-week period following the initiation of therapy. Be sure to install the robust package before continuing.

Data were collected on the age and number of seizures reported by patients suffering from simple or complex partial seizures during an eight-week period before, and eight-week period after, randomization into a drug or placebo condition. SumY (the number of seizures in the eight-week period post-randomization) is the response variable. Treatment condition (Trt), age in years (Age), and number of seizures reported in the baseline eight-week period (Base) are the predictor variables. The baseline number of seizures and age are included because of their potential effect on the response variable. We're interested in whether or not evidence exists that the drug treatment decreases the number of seizures after accounting for these covariates.

First, let’s look at summary statistics for the dataset:

>data(breslow.dat, package="robust")

>names(breslow.dat)

[1]	"ID"	"Y1"		"Y2"	"Y3"	"Y4"	"Base"	"Age" "Trt" "Ysum"
[10]	"sumY"	"Age10"		"Base4"
> summary(breslow.dat[c(6,7,8,10)])
	Base			Age		Trt	sumY
Min.	:	6.0	Min.	:18.0	placebo :28		Min.	: 0.0
1st	Qu.: 12.0		1st	Qu.:23.0	progabide:31		1st Qu.: 11.5
Median : 22.0			Median :28.0				Median	: 16.0
Mean	: 31.2		Mean	:28.3			Mean	: 33.1
3rd	Qu.: 41.0		3rd	Qu.:32.0			3rd Qu.: 36.0
Max.	:151.0		Max.	:42.0			Max.	:302.0

					Poisson regression			313
		Distribution of Seizures				Group Comparisons
	30				300


					250

	25

					200

Frequency	20

					150

	15

					100
	10



					50
	5				50
	0				0





	0 50		150	250		placebo			progabide
			Seizure Count				Treatment

Figure 13.1 Distribution of post-treatment seizure counts (source: Breslow seizure data)

Note that although there are 12 variables in the dataset, we’re limiting our attention to the 4 described earlier. Both the baseline and post-randomization number of seizures are highly skewed. Let’s look at the response variable in more detail. The following code produces the graphs in figure 13.1:

opar <- par(no.readonly=TRUE) par(mfrow=c(1,2)) attach(breslow.dat)

hist(sumY, breaks=20, xlab="Seizure Count", main="Distribution of Seizures")

boxplot(sumY ~ Trt, xlab="Treatment", main="Group Comparisons") par(opar)

You can clearly see the skewed nature of the dependent variable and the possible presence of outliers. At first glance, the number of seizures in the drug condition appears to be smaller and has a smaller variance. (You’d expect a smaller variance to accompany a smaller mean with Poisson distributed data.) Unlike standard OLS regression, this heterogeneity of variance isn’t a problem in Poisson regression.

The next step is to fit the Poisson regression:

>fit <- glm(sumY ~ Base + Age + Trt, data=breslow.dat, family=poisson())

>summary(fit)

314	CHAPTER 13 Generalized linear models

Call:

glm(formula = sumY ~ Base + Age + Trt, family = poisson(), data = breslow.dat)

Deviance Residuals:
Min	1Q	Median	3Q	Max
-6.057 -2.043 -0.940			0.793 11.006
Coefficients:
		Estimate Std. Error z value Pr(>\|z\|)
(Intercept)		1.948826	0.135619	14.37	< 2e-16 ***
Base		0.022652	0.000509	44.48	< 2e-16	***
Age		0.022740	0.004024	5.65	1.6e-08	***
Trtprogabide -0.152701			0.047805	-3.19	0.0014	**
---
Signif. codes:		0 '*' 0.001 '' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 2122.73 on 58 degrees of freedom

Residual deviance: 559.44 on 55 degrees of freedom

AIC: 850.7

Number of Fisher Scoring iterations: 5

The output provides the deviances, regression parameters, and standard errors, and tests that these parameters are 0. Note that each of the predictor variables is significant at the p < 0.05 level.

13.3.1Interpreting the model parameters

The model coefficients are obtained using the coef() function or by examining the Coefficients table in the summary() function output:

> coef(fit)
(Intercept)	Base	Age	Trtprogabide
1.9488	0.0227	0.0227	-0.1527

In a Poisson regression, the dependent variable being modeled is the log of the conditional mean loge(λ). The regression parameter 0.0227 for Age indicates that a oneyear increase in age is associated with a 0.03 increase in the log mean number of seizures, holding baseline seizures and treatment condition constant. The intercept is the log mean number of seizures when each of the predictors equals 0. Because you can’t have a zero age and none of the participants had a zero number of baseline seizures, the intercept isn’t meaningful in this case.

It’s usually much easier to interpret the regression coefficients in the original scale of the dependent variable (number of seizures, rather than log number of seizures). To accomplish this, exponentiate the coefficients:

> exp(coef(fit))
(Intercept)	Base	Age	Trtprogabide
7.020	1.023	1.023	0.858

Now you see that a one-year increase in age multiplies the expected number of seizures by 1.023, holding the other variables constant. This means that increased age is

Poisson regression

315

associated with higher numbers of seizures. More important, a one-unit change in Trt (that is, moving from placebo to progabide) multiplies the expected number of seizures by 0.86. You’d expect a 20% decrease in the number of seizures for the drug group compared with the placebo group, holding baseline number of seizures and age constant.

It’s important to remember that, like the exponentiated parameters in logistic regression, the exponentiated parameters in the Poisson model have a multiplicative rather than an additive effect on the response variable. Also, as with logistic regression, you must evaluate your model for overdispersion.

13.3.2Overdispersion

In a Poisson distribution, the variance and mean are equal. Overdispersion occurs in Poisson regression when the observed variance of the response variable is larger than would be predicted by the Poisson distribution. Because overdispersion is often encountered when dealing with count data and can have a negative impact on the interpretation of the results, we’ll spend some time discussing it.

There are several reasons why overdispersion may occur (Coxe et al., 2009):

■The omission of an important predictor variable can lead to overdispersion.

■Overdispersion can also be caused by a phenomenon known as state dependence. Within observations, each event in a count is assumed to be independent. For the seizure data, this would imply that for any patient, the probability of a seizure is independent of each other seizure. But this assumption is often untenable. For a given individual, the probability of having a first seizure is unlikely to be the same as the probability of having a 40th seizure, given that they’ve already had 39.

■In longitudinal studies, overdispersion can be caused by the clustering inherent in repeated measures data. We won’t discuss longitudinal Poisson models here.

If overdispersion is present and you don’t account for it in your model, you’ll get standard errors and confidence intervals that are too small, and significance tests that are too liberal (that is, you’ll find effects that aren’t really there).

As with logistic regression, overdispersion is suggested if the ratio of the residual deviance to the residual degrees of freedom is much larger than 1. For the seizure data, the ratio is

> deviance(fit)/df.residual(fit) [1] 10.17

which is clearly much larger than 1.

The qcc package provides a test for overdispersion in the Poisson case. (Be sure to download and install this package before first use.) You can test for overdispersion in the seizure data using the following code:

>library(qcc)

>qcc.overdispersion.test(breslow.dat$sumY, type="poisson")

316	CHAPTER 13 Generalized linear models
Overdispersion test Obs.Var/Theor.Var		Statistic	p-value
poisson data	62.9	3646	0

Not surprisingly, the significance test has a p-value less than 0.05, strongly suggesting the presence of overdispersion.

You can still fit a model to your data using the glm() function, by replacing family="poisson" with family="quasipoisson". Doing so is analogous to the approach to logistic regression when overdispersion is present:

> fit.od <- glm(sumY ~ Base + Age + Trt, data=breslow.dat, family=quasipoisson())

> summary(fit.od)

Call:

glm(formula = sumY ~ Base + Age + Trt, family = quasipoisson(), data = breslow.dat)

Deviance Residuals:
Min	1Q	Median	3Q	Max
-6.057 -2.043 -0.940			0.793 11.006
Coefficients:
	Estimate Std. Error t			value Pr(>\|t\|)
(Intercept)		1.94883	0.46509	4.19	0.00010	***
Base		0.02265	0.00175	12.97	< 2e-16 ***
Age		0.02274	0.01380	1.65	0.10509
Trtprogabide -0.15270			0.16394	-0.93	0.35570
---
Signif. codes:		0 '*' 0.001 ''		0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 11.8)

Null deviance: 2122.73 on 58 degrees of freedom

Residual deviance: 559.44 on 55 degrees of freedom

AIC: NA

Number of Fisher Scoring iterations: 5

Notice that the parameter estimates in the quasi-Poisson approach are identical to those produced by the Poisson approach. The standard errors are much larger, though. In this case, the larger standard errors have led to p-values for Trt (and Age) that are greater than 0.05. When you take overdispersion into account, there’s insufficient evidence to declare that the drug regimen reduces seizure counts more than receiving a placebo, after controlling for baseline seizure rate and age.

Please remember that this example is used for demonstration purposes only. The results shouldn’t be taken to imply anything about the efficacy of progabide in the real world. I’m not a doctor—at least not a medical doctor—and I don’t even play one on TV.

We’ll finish this exploration of Poisson regression with a discussion of some important variants and extensions.

<<< < Предыдущая 87 88 89 90 91 92 93 94 95 96 97 9899 / 17399 100 101 102 103 104 105 106 107 108 109 110 111 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
05.08.2019741.83 Кб0psihologia.rtf
#
02.06.2015162.69 Кб76Psyh_final_ver.docx
#
02.06.2015141.74 Кб44Psyh_final_ver.docx
#
26.03.2016226.3 Кб23public_corporation.doc
#
26.03.2016451.53 Кб7pud_finansovyy-menedjment_318476.pdf
#
26.03.201620.33 Mб540R in Action, Second Edition.pdf
#
26.03.2016296.21 Кб17Radaev_Kak_napisat_akademicheskiy_text.pdf
#
26.03.20163.76 Mб4Raeff_Modernity.pdf
#
26.03.20162.12 Mб19raigorodskii_d_ya_hrestomatiya_psihologiya_lich.pdf
#
02.06.2015494.59 Кб6raschet_SRK_smorodin.doc
#
02.06.201563.98 Кб4referat_IOGP_3.docx