Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
540
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

Classification

This chapter covers

Classifying with decision trees

Ensemble classification with random forests

Creating a support vector machine

Evaluating classification accuracy

Data analysts are frequently faced with the need to predict a categorical outcome from a set of predictor variables. Some examples include

Predicting whether an individual will repay a loan, given their demographics and financial history

Determining whether an ER patient is having a heart attack, based on their symptoms and vital signs

Deciding whether an email is spam, given the presence of key words, images, hypertext, header information, and origin

Each of these cases involves the prediction of a binary categorical outcome (good credit risk/bad credit risk, heart attack/no heart attack, spam/not spam) from a set of predictors (also called features). The goal is to find an accurate method of classifying new cases into one of the two groups.

389

390

CHAPTER 17 Classification

The field of supervised machine learning offers numerous classification methods that can be used to predict categorical outcomes, including logistic regression, decision trees, random forests, support vector machines, and neural networks. The first four are discussed in this chapter. Neural networks are beyond the scope of this book.

Supervised learning starts with a set of observations containing values for both the predictor variables and the outcome. The dataset is then divided into a training sample and a validation sample. A predictive model is developed using the data in the training sample and tested for accuracy using the data in the validation sample. Both samples are needed because classification techniques maximize prediction for a given set of data. Estimates of their effectiveness will be overly optimistic if they’re evaluated using the same data that generated the model. By applying the classification rules developed on a training sample to a separate validation sample, you can obtain a more realistic accuracy estimate. Once you’ve created an effective predictive model, you can use it to predict outcomes in situations where only the predictor variables are known.

In this chapter, you’ll use the rpart, rpart.plot, and party packages to create and visualize decision trees; the randomForest package to fit random forests; and the e1071 package to build support vector machines. Logistic regression will be fit with the glm() function in the base R installation. Before starting, be sure to install the necessary packages:

pkgs <- c("rpart", "rpart.plot", "party", "randomForest", "e1071")

install.packages(pkgs, depend=TRUE)

The primary example used in this chapter comes from the Wisconsin Breast Cancer data originally posted to the UCI Machine Learning Repository. The goal will be to develop a model for predicting whether a patient has breast cancer from the characteristics of a fine-needle tissue aspiration (a tissue sample taken with a thin hollow needle from a lump or mass just under the skin).

17.1 Preparing the data

The Wisconsin Breast Cancer dataset is available as a comma-delimited text file on the UCI Machine Learning Server (http://archive.ics.uci.edu/ml). The dataset contains 699 fine-needle aspirate samples, where 458 (65.5%) are benign and 241 (34.5%) are malignant. The dataset contains a total of 11 variables and doesn’t include the variable names in the file. Sixteen samples have missing data and are coded in the text file with a question mark (?).

The variables are as follows:

ID

Clump thickness

Uniformity of cell size

Uniformity of cell shape

Marginal adhesion

Preparing the data

391

Single epithelial cell size

Bare nuclei

Bland chromatin

Normal nucleoli

Mitoses

Class

The first variable is an ID variable (which you’ll drop), and the last variable (class) contains the outcome (coded 2=benign, 4=malignant).

For each sample, nine cytological characteristics previously found to correlate with malignancy are also recorded. These variables are each scored from 1 (closest to benign) to 10 (most anaplastic). But no one predictor alone can distinguish between benign and malignant samples. The challenge is to find a set of classification rules that can be used to accurately predict malignancy from some combination of these nine cell characteristics. See Mangasarian and Wolberg (1990) for details.

In the following listing, the comma-delimited text file containing the data is downloaded from the UCI repository and randomly divided into a training sample (70%) and a validation sample (30%).

Listing 17.1 Preparing the breast cancer data

loc <- "http://archive.ics.uci.edu/ml/machine-learning-databases/" ds <- "breast-cancer-wisconsin/breast-cancer-wisconsin.data"

url <- paste(loc, ds, sep="")

breast <- read.table(url, sep=",", header=FALSE, na.strings="?") names(breast) <- c("ID", "clumpThickness", "sizeUniformity",

"shapeUniformity", "maginalAdhesion", "singleEpithelialCellSize", "bareNuclei", "blandChromatin", "normalNucleoli", "mitosis", "class")

df <- breast[-1]

df$class <- factor(df$class, levels=c(2,4), labels=c("benign", "malignant"))

set.seed(1234)

train <- sample(nrow(df), 0.7*nrow(df)) df.train <- df[train,]

df.validate <- df[-train,] table(df.train$class) table(df.validate$class)

The training sample has 499 cases (329 benign, 160 malignant), and the validation sample has 210 cases (129 benign, 81 malignant).

The training sample will be used to create classification schemes using logistic regression, a decision tree, a conditional decision tree, a random forest, and a support vector machine. The validation sample will be used to evaluate the effectiveness of these schemes. By using the same example throughout the chapter, you can compare the results of each approach.

392

CHAPTER 17 Classification

17.2 Logistic regression

Logistic regression is a type of generalized linear model that is often used to predict a binary outcome from a set of numeric variables (see section 13.2 for details). The glm() function in the base R installation is used for fitting the model. Categorical predictors (factors) are automatically replaced with a set of dummy coded variables. All the predictors in the Wisconsin Breast Cancer data are numeric, so dummy coding is unnecessary. The next listing provides a logistic regression analysis of the data.

Listing 17.2 Logistic regression with glm()

> fit.logit <- glm(class~., data=df.train, family=binomial())

 

 

 

Fits the logistic

 

 

 

> summary(fit.logit)

 

 

 

 

 

 

 

 

b regression

 

 

 

 

 

 

 

 

Call:

 

 

 

 

 

c Examines the model

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

glm(formula = class ~ ., family =

binomial(), data =

df.train)

 

 

 

 

Deviance Residuals:

 

 

 

 

 

 

 

 

 

 

 

 

Min

 

1Q

Median

3Q

 

 

Max

 

 

 

 

 

 

 

-2.7581 -0.1060 -0.0568

0.0124

2.6432

 

 

 

 

 

 

 

Coefficients:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Estimate

Std. Error z value

Pr(>|z|)

 

 

 

 

(Intercept)

 

 

 

-10.4276

1.4760

-7.06

1.6e-12

***

 

 

clumpThickness

 

 

0.5243

0.1595

3.29

0.0010

**

 

 

sizeUniformity

 

 

-0.0481

0.2571

-0.19

0.8517

 

 

 

 

shapeUniformity

 

0.4231

0.2677

1.58

0.1141

 

 

 

 

maginalAdhesion

 

0.2924

0.1469

1.99

0.0465

*

 

 

 

singleEpithelialCellSize

0.1105

0.1798

0.61

0.5387

 

 

 

 

bareNuclei

 

 

 

0.3357

0.1072

3.13

0.0017

**

 

 

blandChromatin

 

 

0.4235

0.2067

2.05

0.0405

*

 

 

 

normalNucleoli

 

 

0.2889

0.1399

2.06

0.0390

*

 

 

 

mitosis

 

 

 

0.6906

0.3983

1.73

0.0829 .

 

 

 

---

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.'

0.1 ' ' 1

 

 

 

 

> prob <- predict(fit.logit, df.validate, type="response")

 

 

 

d Classifies

 

 

 

> logit.pred <- factor(prob > .5,

levels=c(FALSE, TRUE),

 

 

 

 

 

 

labels=c("benign", "malignant"))

 

 

 

 

 

new cases

> logit.perf <- table(df.validate$class, logit.pred,

 

e Evaluates the

 

 

 

 

dnn=c("Actual", "Predicted"))

 

predictive accuracy

> logit.perf

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Predicted

 

 

 

 

 

 

 

 

 

 

 

Actual

benign malignant

 

 

 

 

 

 

 

 

 

 

benign

 

118

 

2

 

 

 

 

 

 

 

 

 

 

malignant

 

4

 

76

 

 

 

 

 

 

 

 

 

 

First, a logistic regression model is fit using class as the dependent variable and the remaining variables as predictors b. The model is based on the cases in the df.train data frame. The coefficients for the model are displayed next c. Section 13.2 provides guidelines for interpreting logistic model coefficients.

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]