17.4 Random forests

A random forest is an ensemble learning approach to supervised learning. Multiple predictive models are developed, and the results are aggregated to improve classification rates. You can find a comprehensive introduction to random forests, written by Leo Breiman and Adele Cutler, at http://mng.bz/7Nul.

The algorithm for a random forest involves sampling cases and variables to create a large number of decision trees. Each case is classified by each decision tree. The most common classification for that case is then used as the outcome.

Assume that N is the number of cases in the training sample and M is the number of variables. Then the algorithm is as follows:

1Grow a large number of decision trees by sampling N cases with replacement from the training set.

2Sample m < M variables at each node. These variables are considered candidates for splitting in that node. The value m is the same for each node.

3Grow each tree fully without pruning (the minimum node size is set to 1).

4Terminal nodes are assigned to a class based on the mode of cases in that node.

5Classify new cases by sending them down all the trees and taking a vote—major- ity rules.

An out-of-bag (OOB) error estimate is obtained by classifying the cases that aren’t selected when building a tree, using that tree. This is an advantage when a validation sample is unavailable. Random forests also provide a natural measure of variable importance, as you’ll see.

Random forests are grown using the randomForest() function in the randomForest package. The default number of trees is 500, the default number of variables sampled at each node is sqrt(M), and the minimum node size is 1.

The following listing provides the code and results for predicting malignancy status in the breast cancer data.

Listing 17.5 Random forest

> library(randomForest)
> set.seed(1234)
> fit.forest <- randomForest(class~., data=df.train,		b Grows the forest
> fit.forest <- randomForest(class~., data=df.train,
na.action=na.roughfix,
importance=TRUE)
> fit.forest
Call:
randomForest(formula = class ~ ., data = df.train,
importance = TRUE,	na.action = na.roughfix)
Type of random forest: classification
Number of trees:	500
No. of variables tried at each split:	3

OOB estimate of error rate: 3.68%

400		CHAPTER 17 Classification
Confusion matrix:
	benign malignant class.error
benign	319	10	0.0304
malignant	8	152	0.0500

> importance(fit.forest, type=2)				Determines variable
> importance(fit.forest, type=2)				Determines variable
		MeanDecreaseGini	c importance
		MeanDecreaseGini
clumpThickness		12.50
sizeUniformity		54.77
shapeUniformity		48.66
maginalAdhesion		5.97
singleEpithelialCellSize		14.30
bareNuclei		34.02
blandChromatin		16.24
normalNucleoli		26.34
mitosis		1.81
> forest.pred <- predict(fit.forest, df.validate)					d Classifies new cases
> forest.pred <- predict(fit.forest, df.validate)
> forest.perf <- table(df.validate$class, forest.pred,
	dnn=c("Actual", "Predicted"))
> forest.perf
	Predicted
Actual	benign malignant
benign	117	3
malignant	1	79

First, the randomForest() function is used to grow 500 traditional decision trees by sampling 489 observations with replacement from the training sample and sampling 3 variables at each node of each tree b. The na.action=na.roughfix option replaces missing values on numeric variables with column medians, and missing values on categorical variables with the modal category for that variable (breaking ties at random).

Random forests can provide a natural measure of variable importance, requested with the information=TRUE option, and printed with the importance() function c. The relative importance measure specified by the type=2 option is the total decrease in node impurities (heterogeneity) from splitting on that variable, averaged over all trees. Node impurity is measured with the Gini coefficient. sizeUniformity is the most important variable and mitosis is the least important.

Finally, the validation sample is classified using the random forest and the predictive accuracy is calculated d. Note that cases with missing values in the validation sample aren’t classified. The prediction accuracy (98% overall) is good.

Whereas the randomForest package provides forests based on traditional decision trees, the cforest() function in the party package can be used to generate random forests based on conditional inference trees. If predictor variables are highly correlated, a random forest using conditional inference trees may provide better predictions.

Random forests tend to be very accurate compared with other classification methods. Additionally, they can handle large problems (many observations and variables), can handle large amounts of missing data in the training set, and can handle cases in

<<< < Предыдущая 111 112 113 114 115 116 117 118 119 120 121 122123 / 173123 124 125 126 127 128 129 130 131 132 133 134 135 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
05.08.2019741.83 Кб0psihologia.rtf
#
02.06.2015162.69 Кб76Psyh_final_ver.docx
#
02.06.2015141.74 Кб44Psyh_final_ver.docx
#
26.03.2016226.3 Кб23public_corporation.doc
#
26.03.2016451.53 Кб7pud_finansovyy-menedjment_318476.pdf
#
26.03.201620.33 Mб540R in Action, Second Edition.pdf
#
26.03.2016296.21 Кб17Radaev_Kak_napisat_akademicheskiy_text.pdf
#
26.03.20163.76 Mб4Raeff_Modernity.pdf
#
26.03.20162.12 Mб19raigorodskii_d_ya_hrestomatiya_psihologiya_lich.pdf
#
02.06.2015494.59 Кб6raschet_SRK_smorodin.doc
#
02.06.201563.98 Кб4referat_IOGP_3.docx