Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
540
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

Random forests

399

17.4 Random forests

A random forest is an ensemble learning approach to supervised learning. Multiple predictive models are developed, and the results are aggregated to improve classification rates. You can find a comprehensive introduction to random forests, written by Leo Breiman and Adele Cutler, at http://mng.bz/7Nul.

The algorithm for a random forest involves sampling cases and variables to create a large number of decision trees. Each case is classified by each decision tree. The most common classification for that case is then used as the outcome.

Assume that N is the number of cases in the training sample and M is the number of variables. Then the algorithm is as follows:

1Grow a large number of decision trees by sampling N cases with replacement from the training set.

2Sample m < M variables at each node. These variables are considered candidates for splitting in that node. The value m is the same for each node.

3Grow each tree fully without pruning (the minimum node size is set to 1).

4Terminal nodes are assigned to a class based on the mode of cases in that node.

5Classify new cases by sending them down all the trees and taking a vote—major- ity rules.

An out-of-bag (OOB) error estimate is obtained by classifying the cases that aren’t selected when building a tree, using that tree. This is an advantage when a validation sample is unavailable. Random forests also provide a natural measure of variable importance, as you’ll see.

Random forests are grown using the randomForest() function in the randomForest package. The default number of trees is 500, the default number of variables sampled at each node is sqrt(M), and the minimum node size is 1.

The following listing provides the code and results for predicting malignancy status in the breast cancer data.

Listing 17.5 Random forest

> library(randomForest)

 

 

 

> set.seed(1234)

 

 

 

> fit.forest <- randomForest(class~., data=df.train,

 

b Grows the forest

 

na.action=na.roughfix,

 

importance=TRUE)

 

 

> fit.forest

 

 

 

Call:

 

 

 

randomForest(formula = class ~ ., data = df.train,

 

 

importance = TRUE,

na.action = na.roughfix)

Type of random forest: classification

 

 

Number of trees:

500

 

 

No. of variables tried at each split:

3

 

 

OOB estimate of error rate: 3.68%

400

 

CHAPTER 17 Classification

Confusion matrix:

 

 

 

benign malignant class.error

benign

319

10

0.0304

malignant

8

152

0.0500

> importance(fit.forest, type=2)

 

 

Determines variable

 

 

 

 

MeanDecreaseGini

c importance

 

 

 

 

 

clumpThickness

12.50

 

 

 

 

sizeUniformity

54.77

 

 

 

 

shapeUniformity

48.66

 

 

 

 

maginalAdhesion

5.97

 

 

 

 

singleEpithelialCellSize

14.30

 

 

 

 

bareNuclei

 

34.02

 

 

 

 

blandChromatin

16.24

 

 

 

 

normalNucleoli

26.34

 

 

 

 

mitosis

 

1.81

 

 

 

 

> forest.pred <- predict(fit.forest, df.validate)

 

 

d Classifies new cases

 

 

> forest.perf <- table(df.validate$class, forest.pred,

 

dnn=c("Actual", "Predicted"))

 

> forest.perf

 

 

 

 

 

 

Predicted

 

 

 

 

 

Actual

benign malignant

 

 

 

benign

117

3

 

 

 

 

malignant

1

79

 

 

 

 

First, the randomForest() function is used to grow 500 traditional decision trees by sampling 489 observations with replacement from the training sample and sampling 3 variables at each node of each tree b. The na.action=na.roughfix option replaces missing values on numeric variables with column medians, and missing values on categorical variables with the modal category for that variable (breaking ties at random).

Random forests can provide a natural measure of variable importance, requested with the information=TRUE option, and printed with the importance() function c. The relative importance measure specified by the type=2 option is the total decrease in node impurities (heterogeneity) from splitting on that variable, averaged over all trees. Node impurity is measured with the Gini coefficient. sizeUniformity is the most important variable and mitosis is the least important.

Finally, the validation sample is classified using the random forest and the predictive accuracy is calculated d. Note that cases with missing values in the validation sample aren’t classified. The prediction accuracy (98% overall) is good.

Whereas the randomForest package provides forests based on traditional decision trees, the cforest() function in the party package can be used to generate random forests based on conditional inference trees. If predictor variables are highly correlated, a random forest using conditional inference trees may provide better predictions.

Random forests tend to be very accurate compared with other classification methods. Additionally, they can handle large problems (many observations and variables), can handle large amounts of missing data in the training set, and can handle cases in

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]