ISLR Home

p359

Support Vector Classifier

We begin by generating the observations, which belong to two classes, and checking whether the classes are linearly separable.

set.seed(1)
x=matrix(rnorm(20*2), ncol=2)
y=c(rep(-1,10), rep(1,10))
x[y==1,]=x[y==1,] + 1
plot(x, col=(3-y))

They are not.

We fit the support vector classifier. Note that in order for the svm() function to perform classification (as opposed to SVM-based regression), we must encode the response as a factor variable. We now create a data frame with the response coded as a factor.

The argument scale=FALSE tells the svm() function not to scale each feature to have mean zero or standard deviation one;

dat=data.frame(x=x, y=as.factor(y))
library(e1071)
svmfit=svm(y~., data=dat, kernel="linear", cost=10, scale=FALSE)

Plot the support vector classifier

Note that here the second feature is plotted on the x-axis and the first feature is plotted on the y-axis, in contrast to the behavior of the usual plot() function in R.

plot(svmfit, dat)

7 support vectors

svmfit$index
## [1]  1  2  5  7 14 16 17
summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10, scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  7
## 
##  ( 4 3 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1

There were seven support vectors, four in one class and three in the other.

Observations that lie directly on the margin, or on the wrong side of the margin for their class, are known as support vectors

Cost Function

Notes

A cost argument allows us to specify the cost of a violation to the margin.

When the cost argument is small, then the margins will be wide and many support vectors will be on the margin or will violate the margin. When the cost argument is large, then the margins will be narrow and there will be few support vectors on the margin or violating the margin.

Change cost to 0.1

svmfit=svm(y~., data=dat, kernel="linear", cost=0.1, scale=FALSE)
plot(svmfit, dat)

16 support vectors

Now that a smaller value of the cost parameter is being used, we obtain a larger number of support vectors, because the margin is now wider.

If a training observation is not a support vector, then its \(\alpha_i\) equals zero

svmfit$index
##  [1]  1  2  3  4  5  7  9 10 12 13 14 15 16 17 18 20

Tune/Cross Validate

Use tune(), to perform cross-validation. By default, tune() performs ten-fold cross-validation on a set of models of interest.

set.seed(1)
tune.out=tune(svm, y~., data=dat, kernel="linear",
              ranges=list(cost=c(0.001, 0.01, 0.1, 1,5,10,100)))

Summary

We can easily access the cross-validation errors for each of these models using the summary() command:

summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##   0.1
## 
## - best performance: 0.05 
## 
## - Detailed performance results:
##    cost error dispersion
## 1 1e-03  0.55  0.4377975
## 2 1e-02  0.55  0.4377975
## 3 1e-01  0.05  0.1581139
## 4 1e+00  0.15  0.2415229
## 5 5e+00  0.15  0.2415229
## 6 1e+01  0.15  0.2415229
## 7 1e+02  0.15  0.2415229

We see that cost=0.1 results in the lowest cross-validation error rate.

Best Model

bestmodel=tune.out$best.model
summary(bestmodel)
## 
## Call:
## best.tune(method = svm, train.x = y ~ ., data = dat, ranges = list(cost = c(0.001, 
##     0.01, 0.1, 1, 5, 10, 100)), kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  0.1 
## 
## Number of Support Vectors:  16
## 
##  ( 8 8 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1

Predict

Generate test data

xtest=matrix(rnorm(20*2), ncol=2)
ytest=sample(c(-1,1), 20, replace = TRUE)
xtest [ ytest ==1 ,]= xtest [ ytest ==1 ,] + 1
testdat=data.frame(x=xtest, y=as.factor(ytest))

Predict and Generate Confusion Matrix

ypred=predict(bestmodel, testdat)
table(predict=ypred, truth=testdat$y)
##        truth
## predict -1 1
##      -1  9 1
##      1   2 8

Change cost to 0.01

svmfit=svm(y~., data=dat, kernel="linear", cost=.01, scale=FALSE)
ypred=predict(svmfit ,testdat)
table(predict=ypred, truth=testdat$y)
##        truth
## predict -1  1
##      -1 11  6
##      1   0  3

Linearly Separable Example

p362 Consider a situation in which the two classes are linearly separable.

x[y==1,]=x[y==1,]+0.5
plot(x, col=(y+5)/2, pch=19)

The observations are just barely linearly separable

We fit the support vector classifier and plot the resulting hyperplane, using a very large value of cost so that no observations are misclassified.

dat=data.frame(x=x,y=as.factor(y))
svmfit=svm(y~., data=dat, kernel="linear", cost=1e5)
summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 1e+05)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1e+05 
## 
## Number of Support Vectors:  3
## 
##  ( 1 2 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1
plot(svmfit , dat)

No training errors were made and only three support vectors were used. However, we can see from the figure that the margin is very narrow

Try a smaller value of cost=1

svmfit=svm(y~., data=dat, kernel="linear", cost=1)
summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
## 
## Number of Support Vectors:  7
## 
##  ( 4 3 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  -1 1
plot(svmfit, dat)

Using cost=1, we misclassify a training observation, but we also obtain a much wider margin and make use of seven support vectors. It seems likely that this model will perform better on test data than the model with cost=1e5.

Higher cost value means more overfitting.

Support Vector Machine

p363

Notes

The support vector machine (SVM) is an extension of the support vector classifier that results from enlarging the feature space in a specific way, using kernels. p350

When the support vector classifier is combined with a non-linear kernel such as (9.22), the resulting classifier is known as a support vector machine

Change the kernel parameter. We use a different value of the parameter kernel. To fit an SVM with a polynomial kernel we use kernel=“polynomial”, and to fit an SVM with a radial kernel we use kernel=“radial”

  • In the former case we also use the degree argument to specify a degree for the polynomial kernel (this is d in (9.22)),
  • in the latter case we use gamma to specify a value of \(\gamma\) for the radial basis kernel (9.24).

Lab

We first generate some data with a non-linear class boundary,

set.seed(1)
x=matrix(rnorm(200*2), ncol=2)
x[1:100,]=x[1:100,]+2
x[101:150,]=x[101:150,]-2
y=c(rep(1,150),rep(2,50))
dat=data.frame(x=x,y=as.factor(y))

Plotting the data makes it clear that the class boundary is indeed non-linear.

plot(x, col=y)

The data is randomly split into training and testing groups. We then fit the training data using the svm() function with a radial kernel and \(\gamma = 1\)

train=sample(200,100)
svmfit=svm(y~., data=dat[train,], kernel="radial", gamma=1,
           cost=1)
plot(svmfit, dat[train,])

summary(svmfit)
## 
## Call:
## svm(formula = y ~ ., data = dat[train, ], kernel = "radial", gamma = 1, 
##     cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  31
## 
##  ( 16 15 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  1 2

Increase cost to 1e5 to reduce training errors.

If we increase the value of cost, we can reduce the number of training errors. However, this comes at the price of a more irr

svmfit=svm(y~., data=dat[train,], kernel="radial", gamma=1, cost=1e5)
plot(svmfit, dat[train ,])

Perform cross-validation using tune() to select the best choice of \(\gamma\) and cost for an SVM with a radial kernel.

set.seed(1)
tune.out=tune(svm, y~., data=dat[train,],
              kernel="radial",
              ranges=list(cost=c(0.1,1,10,100,1000),
                          gamma=c(0.5,1,2,3,4) ))
summary(tune.out)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##     1   0.5
## 
## - best performance: 0.07 
## 
## - Detailed performance results:
##     cost gamma error dispersion
## 1  1e-01   0.5  0.26 0.15776213
## 2  1e+00   0.5  0.07 0.08232726
## 3  1e+01   0.5  0.07 0.08232726
## 4  1e+02   0.5  0.14 0.15055453
## 5  1e+03   0.5  0.11 0.07378648
## 6  1e-01   1.0  0.22 0.16193277
## 7  1e+00   1.0  0.07 0.08232726
## 8  1e+01   1.0  0.09 0.07378648
## 9  1e+02   1.0  0.12 0.12292726
## 10 1e+03   1.0  0.11 0.11005049
## 11 1e-01   2.0  0.27 0.15670212
## 12 1e+00   2.0  0.07 0.08232726
## 13 1e+01   2.0  0.11 0.07378648
## 14 1e+02   2.0  0.12 0.13165612
## 15 1e+03   2.0  0.16 0.13498971
## 16 1e-01   3.0  0.27 0.15670212
## 17 1e+00   3.0  0.07 0.08232726
## 18 1e+01   3.0  0.08 0.07888106
## 19 1e+02   3.0  0.13 0.14181365
## 20 1e+03   3.0  0.15 0.13540064
## 21 1e-01   4.0  0.27 0.15670212
## 22 1e+00   4.0  0.07 0.08232726
## 23 1e+01   4.0  0.09 0.07378648
## 24 1e+02   4.0  0.13 0.14181365
## 25 1e+03   4.0  0.15 0.13540064

The best choice of parameters involves cost=1 and gamma=2

table(true=dat[-train,"y"], pred=predict(tune.out$best.model,
                                         newdata=dat[-train,]))
##     pred
## true  1  2
##    1 67 10
##    2  2 21

10% of test observations are misclassified by this SVM.

ROC Curves

p365

library(ROCR) 
rocplot=function(pred, truth, ...) {
  predob = prediction(pred, truth)
  perf = performance (predob , "tpr", "fpr")
  plot(perf ,...)
  }

The predict() function will output the fitted values.

svmfit.opt=svm(y~., data=dat[train,], kernel="radial",
               gamma=2, 
               cost=1,
               decision.values=T)
fitted.training=attributes(predict(svmfit.opt, dat[train,],
                          decision.values=TRUE))$decision.values

ROC plot

By increasing \(\gamma\) we can produce a more flexible fit and generate further improvements in accuracy.

svmfit.flex=svm(y~., data=dat[train,], kernel="radial",
                gamma=50, 
                cost=1, 
                decision.values=T)
fitted.gamma50=attributes(predict(svmfit.flex,dat[train,],decision.values=T))$decision.values
{
par(mfrow=c(1,2))
rocplot(fitted.training, dat[train,"y"], main="Training Data")
rocplot(fitted.gamma50, dat[train ,"y"], add=T, col="red")
}

When we compute the ROC curves on the test data, the model with \(\gamma=2\) appears to provide the most accurate results.

{fitted.test=attributes(predict(svmfit.opt,dat[-train,],
                                decision.values=T))$decision.values
rocplot(fitted.test, dat[-train,"y"], main="Test Data")
fitted.test2=attributes(predict(svmfit.flex,dat[-train,],decision.values=T))$decision.values
rocplot(fitted.test2, dat[-train,"y"], add=T,col="red")
}

SVM with Multiple Classes

p366

If the response is a factor containing more than two levels, then the svm() function will perform multi-class classification using the one-versus-one approach. We explore that setting here by generating a third class of observations.

set.seed(1)
x=rbind(x, matrix(rnorm(50*2), ncol=2))
y=c(y, rep(0,50))
x[y==0,2]=x[y==0,2]+2
dat=data.frame(x=x, y=as.factor(y))
par(mfrow=c(1,1))
plot(x,col=(y+1))

Fit an SVM to the data

svmfit=svm(y~., data=dat, kernel="radial", cost=10, gamma=1)
plot(svmfit , dat)

Application to Gene Expression Data

The Khan data set, which consists of a number of tissue samples corresponding to four distinct types of small round blue cell tumors. For each tissue sample, gene expression measurements are available.

library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.3
names(Khan)
## [1] "xtrain" "xtest"  "ytrain" "ytest"
dim(Khan$xtrain)
## [1]   63 2308
dim(Khan$xtest)
## [1]   20 2308
length(Khan$ytrain)
## [1] 63
length(Khan$ytest)
## [1] 20

This data set consists of expression measurements for 2,308 genes. The training and test sets consist of 63 and 20 observations respectively.

table(Khan$ytrain)
## 
##  1  2  3  4 
##  8 23 12 20
table(Khan$ytest)
## 
## 1 2 3 4 
## 3 6 6 5

We will use a support vector approach to predict cancer subtype using gene expression measurements. In this data set, there are a very large number of features relative to the number of observations. This suggests that we should use a linear kernel, because the additional flexibility that will result from using a polynomial or radial kernel is unnecessary.

dat=data.frame(x=Khan$xtrain, y=as.factor(Khan$ytrain))
out=svm(y~., data=dat, kernel="linear",cost=10)
summary(out)
## 
## Call:
## svm(formula = y ~ ., data = dat, kernel = "linear", cost = 10)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  10 
## 
## Number of Support Vectors:  58
## 
##  ( 20 20 11 7 )
## 
## 
## Number of Classes:  4 
## 
## Levels: 
##  1 2 3 4
table(out$fitted, dat$y)
##    
##      1  2  3  4
##   1  8  0  0  0
##   2  0 23  0  0
##   3  0  0 12  0
##   4  0  0  0 20

We see that there are no training errors. In fact, this is not surprising, because the large number of variables relative to the number of observations implies that it is easy to find hyperplanes that fully separate the classes.

dat.te=data.frame(x=Khan$xtest, y=as.factor(Khan$ytest))
pred.te=predict(out, newdata=dat.te)
table(pred.te, dat.te$y)
##        
## pred.te 1 2 3 4
##       1 3 0 0 0
##       2 0 6 2 0
##       3 0 0 4 0
##       4 0 0 0 5

We see that using cost=10 yields two test set errors on this data.