Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
546
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

Support vector machines

401

which the number of variables is much greater than the number of observations. The provision of OOB error rates and measures of variable importance are also significant advantages.

A significant disadvantage is that it’s difficult to understand the classification rules (there are 500 trees!) and communicate them to others. Additionally, you need to store the entire forest in order to classify new cases.

The final classification model we’ll consider here is the support vector machine, described next.

17.5 Support vector machines

Support vector machines (SVMs) are a group of supervised machine-learning models that can be used for classification and regression. They’re popular at present, in part because of their success in developing accurate prediction models, and in part because of the elegant mathematics that underlie the approach. We’ll focus on the use of SVMs for binary classification.

SVMs seek an optimal hyperplane for separating two classes in a multidimensional space. The hyperplane is chosen to maximize the margin between the two classes’ closest points. The points on the boundary of the margin are called support vectors (they help define the margin), and the middle of the margin is the separating hyperplane.

For an N-dimensional space (that is, with N predictor variables), the optimal hyperplane (also called a linear decision surface) has N – 1 dimensions. If there are two variables, the surface is a line. For three variables, the surface is a plane. For 10 variables, the surface is a 9-dimensional hyperplane. Trying to picture it will give you headache.

Consider the two-dimensional example shown in figure 17.4. Circles and triangles represent the two groups. The margin is the gap, represented by the distance between

Linear Separable Features

 

4

 

2

x2

0

 

−2

 

−4

−2

−1

0

1

2

x1

Figure 17.4 Two-group classification problem where the two groups are linearly separable. The separating hyperplane is indicated by the solid black line. The margin is the distance from the line to the dashed line on either side. The filled circles and triangles are the support vectors.

402

CHAPTER 17 Classification

the two dashed lines. The points on the dashed lines (filled circles and triangles) are the support vectors. In the two-dimensional case, the optimal hyperplane is the black line in the middle of the gap. In this idealized example, the two groups are linearly separable—the line can completely separate the two groups without errors.

The optimal hyperplane is identified using quadratic programming to optimize the margin under the constraint that the data points on one side have an outcome value of +1 and the data on the other side has an outcome value of -1. If the data points are “almost” separable (not all the points are on one side or the other), a penalizing term is added to the optimization in order to account for errors, and “soft” margins are produced.

But the data may be fundamentally nonlinear. Consider the example in figure 17.5. There is no line that can correctly separate the circles and triangles. SVMs use kernel functions to transform the data into higher dimensions, in the hope that they will become more linearly separable. Imagine transforming the data in figure 17.5 in such a way that the circles lift off the page. One way to do this is to transform the twodimensional data into three dimensions using

(X,Y) (X 2,2 XY,Y 2 ) (Z1, Z2, Z2)

Then you can separate the triangles from the circles using a rigid sheet of paper (that is, a two-dimensional plane in what is now a three-dimensional space).

The mathematics of SVMs is complex and well beyond the scope of this book. Statnikov, Aliferis, Hardin, & Guyon (2011) offer a lucid and intuitive presentation of SVMs that goes into quite a bit of conceptual detail without getting bogged down in higher math.

Features are not Linearly Separable

Y −3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

X

Figure 17.5 Two-group classification problem where the two groups aren’t linearly separable. The groups can’t be separated with a hyperplane (line).

Support vector machines

403

SVMs are available in R using the ksvm() function in the kernlab package and the svm() function in the e1071 package. The former is more powerful, but the latter is a bit easier to use. The example in the next listing uses the latter (easy is good) to develop an SVM for the Wisconsin breast cancer data.

Listing 17.6 A support vector machine

>library(e1071)

>set.seed(1234)

>fit.svm <- svm(class~., data=df.train)

>fit.svm

Call:

svm(formula = class ~ ., data = df.train)

Parameters:

SVM-Type: C-classification SVM-Kernel: radial

cost: 1 gamma: 0.1111

Number of Support Vectors: 76

>svm.pred <- predict(fit.svm, na.omit(df.validate))

>svm.perf <- table(na.omit(df.validate)$class,

svm.pred, dnn=c("Actual", "Predicted"))

> svm.perf

 

Predicted

 

Actual

benign malignant

benign

116

4

malignant

3

77

Because predictor variables with larger variances typically have a greater influence on the development of SVMs, the svm() function scales each variable to a mean of 0 and standard deviation of 1 before fitting the model by default. As you can see, the predictive accuracy is good, but not quite as good as that found for the random forest approach in section 17.2. Unlike the random forest approach, the SVM is also unable to accommodate missing predictor values when classifying new cases.

17.5.1Tuning an SVM

By default, the svm() function uses a radial basis function (RBF) to map samples into a higher-dimensional space (the kernel trick). The RBF kernel is often a good choice because it’s a nonlinear mapping that can handle relations between class labels and predictors that are nonlinear.

When fitting an SVM with the RBF kernel, two parameters can affect the results: gamma and cost. Gamma is a kernel parameter that controls the shape of the separating hyperplane. Larger values of gamma typically result in a larger number of support vectors. Gamma can also be thought of as a parameter that controls how widely a

404

CHAPTER 17 Classification

training sample “reaches,” with larger values meaning far and smaller values meaning close. Gamma must be greater than zero.

The cost parameter represents the cost of making errors. A large value severely penalizes errors and leads to a more complex classification boundary. There will be less misclassifications in the training sample, but over-fitting may result in poor predictive ability in new samples. Smaller values lead to a flatter classification boundary but may result in under-fitting. Like gamma, cost is always positive.

By default, the svm() function sets gamma to 1 / (number of predictors) and cost to 1. But a different combination of gamma and cost may lead to a more effective model. You can try fitting SVMs by varying parameter values one at a time, but a grid search is more efficient. You can specify a range of values for each parameter using the tune.svm() function. tune.svm() fits every combination of values and reports on the performance of each. An example is given next.

Listing 17.7 Tuning an RBF support vector machine

> set.seed(1234)

 

 

 

 

 

 

 

 

> tuned <- tune.svm(class~., data=df.train,

 

 

b Varies the parameters

 

 

 

 

 

gamma=10^(-6:1),

 

 

 

 

 

cost=10^(-10:10))

 

 

 

 

 

 

 

> tuned

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

- sampling method: 10-fold cross validation

c Prints the best model

 

 

 

 

 

 

 

- best parameters:

 

 

 

 

 

 

 

 

gamma cost

 

 

 

 

 

 

 

 

 

0.01

1

 

 

 

 

 

Fits the model with d

 

 

 

 

 

 

 

- best performance: 0.02904

 

 

 

 

these parameters

 

> fit.svm <- svm(class~., data=df.train, gamma=.01, cost=1)

 

 

 

 

 

 

 

 

> svm.pred <- predict(fit.svm, na.omit(df.validate))

 

e Evaluates the

 

> svm.perf <- table(na.omit(df.validate)$class,

 

 

 

 

 

 

 

svm.pred, dnn=c("Actual", "Predicted"))

 

cross-validation

 

 

 

 

performance

> svm.perf

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Predicted

 

 

 

 

 

 

 

 

Actual

 

benign malignant

 

 

 

 

 

 

 

benign

 

117

3

 

 

 

 

 

 

 

malignant

3

77

 

 

 

 

 

 

 

First, an SVM model is fit with an RBF kernel and varying values of gamma and cost b. Eight values of gamma (ranging from 0.000001 to 10) and 21 values of cost (ranging from .01 to 10000000000) are specified. In all, 168 models (8 × 21) are fit and compared. The model with the fewest 10-fold cross validated errors in the training sample has gamma = 0.01 and cost = 1.

Using these parameter values, a new SVM is fit to the training sample d. The model is then used to predict outcomes in the validation sample e, and the number of errors is displayed. Tuning the model c decreased the number of errors slightly (from seven to six). In many cases, tuning the SVM parameters will lead to greater gains.

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]