Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный исследовательский университет «Высшая школа экономики»

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

R in Action, Second Edition.pdf

Скачиваний:

540

Добавлен:

26.03.2016

Размер:

20.33 Mб

Скачать

☆

<<< < Предыдущая 108 109 110 111 112 113 114 115 116 117 118 119120 / 173120 121 122 123 124 125 126 127 128 129 130 131 132 > Следующая >>>

Classification

This chapter covers

■Classifying with decision trees

■Ensemble classification with random forests

■Creating a support vector machine

■Evaluating classification accuracy

Data analysts are frequently faced with the need to predict a categorical outcome from a set of predictor variables. Some examples include

■Predicting whether an individual will repay a loan, given their demographics and financial history

■Determining whether an ER patient is having a heart attack, based on their symptoms and vital signs

■Deciding whether an email is spam, given the presence of key words, images, hypertext, header information, and origin

Each of these cases involves the prediction of a binary categorical outcome (good credit risk/bad credit risk, heart attack/no heart attack, spam/not spam) from a set of predictors (also called features). The goal is to find an accurate method of classifying new cases into one of the two groups.

389

390	CHAPTER 17 Classification

The field of supervised machine learning offers numerous classification methods that can be used to predict categorical outcomes, including logistic regression, decision trees, random forests, support vector machines, and neural networks. The first four are discussed in this chapter. Neural networks are beyond the scope of this book.

Supervised learning starts with a set of observations containing values for both the predictor variables and the outcome. The dataset is then divided into a training sample and a validation sample. A predictive model is developed using the data in the training sample and tested for accuracy using the data in the validation sample. Both samples are needed because classification techniques maximize prediction for a given set of data. Estimates of their effectiveness will be overly optimistic if they’re evaluated using the same data that generated the model. By applying the classification rules developed on a training sample to a separate validation sample, you can obtain a more realistic accuracy estimate. Once you’ve created an effective predictive model, you can use it to predict outcomes in situations where only the predictor variables are known.

In this chapter, you’ll use the rpart, rpart.plot, and party packages to create and visualize decision trees; the randomForest package to fit random forests; and the e1071 package to build support vector machines. Logistic regression will be fit with the glm() function in the base R installation. Before starting, be sure to install the necessary packages:

pkgs <- c("rpart", "rpart.plot", "party", "randomForest", "e1071")

install.packages(pkgs, depend=TRUE)

The primary example used in this chapter comes from the Wisconsin Breast Cancer data originally posted to the UCI Machine Learning Repository. The goal will be to develop a model for predicting whether a patient has breast cancer from the characteristics of a fine-needle tissue aspiration (a tissue sample taken with a thin hollow needle from a lump or mass just under the skin).

17.1 Preparing the data

The Wisconsin Breast Cancer dataset is available as a comma-delimited text file on the UCI Machine Learning Server (http://archive.ics.uci.edu/ml). The dataset contains 699 fine-needle aspirate samples, where 458 (65.5%) are benign and 241 (34.5%) are malignant. The dataset contains a total of 11 variables and doesn’t include the variable names in the file. Sixteen samples have missing data and are coded in the text file with a question mark (?).

The variables are as follows:

■ID

■Clump thickness

■Uniformity of cell size

■Uniformity of cell shape

■Marginal adhesion

Preparing the data

391

■Single epithelial cell size

■Bare nuclei

■Bland chromatin

■Normal nucleoli

■Mitoses

■Class

The first variable is an ID variable (which you’ll drop), and the last variable (class) contains the outcome (coded 2=benign, 4=malignant).

For each sample, nine cytological characteristics previously found to correlate with malignancy are also recorded. These variables are each scored from 1 (closest to benign) to 10 (most anaplastic). But no one predictor alone can distinguish between benign and malignant samples. The challenge is to find a set of classification rules that can be used to accurately predict malignancy from some combination of these nine cell characteristics. See Mangasarian and Wolberg (1990) for details.

In the following listing, the comma-delimited text file containing the data is downloaded from the UCI repository and randomly divided into a training sample (70%) and a validation sample (30%).

Listing 17.1 Preparing the breast cancer data

loc <- "http://archive.ics.uci.edu/ml/machine-learning-databases/" ds <- "breast-cancer-wisconsin/breast-cancer-wisconsin.data"

url <- paste(loc, ds, sep="")

breast <- read.table(url, sep=",", header=FALSE, na.strings="?") names(breast) <- c("ID", "clumpThickness", "sizeUniformity",

"shapeUniformity", "maginalAdhesion", "singleEpithelialCellSize", "bareNuclei", "blandChromatin", "normalNucleoli", "mitosis", "class")

df <- breast[-1]

df$class <- factor(df$class, levels=c(2,4), labels=c("benign", "malignant"))

set.seed(1234)

train <- sample(nrow(df), 0.7*nrow(df)) df.train <- df[train,]

df.validate <- df[-train,] table(df.train$class) table(df.validate$class)

The training sample has 499 cases (329 benign, 160 malignant), and the validation sample has 210 cases (129 benign, 81 malignant).

The training sample will be used to create classification schemes using logistic regression, a decision tree, a conditional decision tree, a random forest, and a support vector machine. The validation sample will be used to evaluate the effectiveness of these schemes. By using the same example throughout the chapter, you can compare the results of each approach.

392	CHAPTER 17 Classification

17.2 Logistic regression

Logistic regression is a type of generalized linear model that is often used to predict a binary outcome from a set of numeric variables (see section 13.2 for details). The glm() function in the base R installation is used for fitting the model. Categorical predictors (factors) are automatically replaced with a set of dummy coded variables. All the predictors in the Wisconsin Breast Cancer data are numeric, so dummy coding is unnecessary. The next listing provides a logistic regression analysis of the data.

Listing 17.2 Logistic regression with glm()

> fit.logit <- glm(class~., data=df.train, family=binomial())													Fits the logistic
													Fits the logistic
> summary(fit.logit)												b regression
> summary(fit.logit)												b regression
Call:						c Examines the model
Call:
glm(formula = class ~ ., family =					binomial(), data =				df.train)
Deviance Residuals:
Min		1Q	Median	3Q			Max
-2.7581 -0.1060 -0.0568				0.0124	2.6432
Coefficients:
				Estimate	Std. Error z value				Pr(>\|z\|)
(Intercept)				-10.4276	1.4760			-7.06	1.6e-12		***
clumpThickness				0.5243	0.1595			3.29	0.0010		**
sizeUniformity				-0.0481	0.2571			-0.19	0.8517
shapeUniformity				0.4231	0.2677			1.58	0.1141
maginalAdhesion				0.2924	0.1469			1.99	0.0465		*
singleEpithelialCellSize				0.1105	0.1798			0.61	0.5387
bareNuclei				0.3357	0.1072			3.13	0.0017		**
blandChromatin				0.4235	0.2067			2.05	0.0405		*
normalNucleoli				0.2889	0.1399			2.06	0.0390		*
mitosis				0.6906	0.3983			1.73	0.0829 .
---
Signif. codes:		0 '*' 0.001 '' 0.01 '*' 0.05 '.'							0.1 ' ' 1
> prob <- predict(fit.logit, df.validate, type="response")													d Classifies
> prob <- predict(fit.logit, df.validate, type="response")
> logit.pred <- factor(prob > .5,					levels=c(FALSE, TRUE),
			labels=c("benign", "malignant"))										new cases
> logit.perf <- table(df.validate$class, logit.pred,										e Evaluates the
> logit.perf <- table(df.validate$class, logit.pred,										e Evaluates the
			dnn=c("Actual", "Predicted"))							predictive accuracy
> logit.perf
	Predicted
Actual	benign malignant
benign		118		2
malignant		4		76

First, a logistic regression model is fit using class as the dependent variable and the remaining variables as predictors b. The model is based on the cases in the df.train data frame. The coefficients for the model are displayed next c. Section 13.2 provides guidelines for interpreting logistic model coefficients.

<<< < Предыдущая 108 109 110 111 112 113 114 115 116 117 118 119120 / 173120 121 122 123 124 125 126 127 128 129 130 131 132 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
05.08.2019741.83 Кб0psihologia.rtf
#
02.06.2015162.69 Кб76Psyh_final_ver.docx
#
02.06.2015141.74 Кб44Psyh_final_ver.docx
#
26.03.2016226.3 Кб23public_corporation.doc
#
26.03.2016451.53 Кб7pud_finansovyy-menedjment_318476.pdf
#
26.03.201620.33 Mб540R in Action, Second Edition.pdf
#
26.03.2016296.21 Кб17Radaev_Kak_napisat_akademicheskiy_text.pdf
#
26.03.20163.76 Mб4Raeff_Modernity.pdf
#
26.03.20162.12 Mб19raigorodskii_d_ya_hrestomatiya_psihologiya_lich.pdf
#
02.06.2015494.59 Кб6raschet_SRK_smorodin.doc
#
02.06.201563.98 Кб4referat_IOGP_3.docx