Chapter 9: Support Vector Machines
A simple linear classifier
In this chapter we learn (yet) another algorithm for classification. We start with a case where the boundary is linear, and then make tweaks to accommodate more complex cases. Above we see a simple linear classifier, everything to the left of the paint is road & to the right is grass.
We have estimated both linear & non-linear boundaries before using logistic regression & decision trees. Support vector machines arrive at the decision boundary by optimizing a “hyperplane”.
Classification using a hyperplane
In p dimensional space, a hyperplane is a p-1 dimensional subspace. The hyperplane in 2 dimensional space is a line (1 dimensional), in 3 dimensional space it is a plane (2 dimensional). We stop the geometrical interpretation there. Think of a hyperplane as an object that divides the current space in two distinct regions.
Before proceeding it’s important to note that classification using a hyperplane can work only when the classes are linearly separable. Below we see a graphic depicting this situation
(A) is linearly separable whilst (B) is not
The equation of a hyperplane is quite similar to linear regression, albeit with a slight tweak. We classify each point to a given class based on which side of the hyperplane it falls on. In the above example, all points above the hyperplane are classified as black while all points below are classified as white.
Currently we formulate the classification rule for 2 classes, say (-1,1). That is, each observation X can have a corresponding response value Y of -1 or 1. We construct our hyperplane which has the following property
A test observation (x) is assigned a class depending on which side of the hyperplane it is located. If f(x) > 0, then x* is assigned a class +1, if f(x<0) then x is assigned class -1.
3 possible hyperplanes (left) & the optimal hyperplane (right). (Source : ISL Python)
But here we run into our first problem. If 2 classes are linearly separable, then there exist an infinity of hyperplanes between them. How do we pick one among all these options? Which one is the best?
Maximal Margin Classifier
A natural choice for the best hyperplane is the one which is farthest from the training observations from each side. This is because it leaves more breathing room while working with test data. This is called the maximum margin classifier. In a sense, the maximal margin hyperplane represents the mid-line of the widest “slab” that we can insert between the two classes
We compute the perpendicular distance of each training observation to a given hyperplane. The smallest distance of this set is called the margin. We pick the hyperplane with the largest margin. In one (weird) sentence, we pick the hyperplane that has the farthest minimum distance to the training observations.
Optimal hyperplane & margin. (Source : ISL Python)
In the picture above, there are 3 points closest to the hyperplane. The margin is the distance from the hyperplane to either of the dashed lines. The points that lie on the dashed lines are called the support vectors. They are called so because moving them even a little, will move the hyperplane as well.
Interestingly, the optimal hyperplane depends only on the support vectors. Moving any of the other observations (provided they do not fall below the margin) would not affect the optimal hyperplane. This means that the maximum margin classifier depends only on a small subset of the training observations! Another thing to notice is that the maximum margin classifier by construction will have 100% training accuracy!
Let’s now flesh out the algorithm in a manner that a computer can follow
Here M represents the margin. The only condition that the algorithm has to follow is to maximize that margin (the specifics of how that is done is out of the scope of this book). On top of this we add 2 constraints to make our life simpler.
The first constraint ensures that we are not considering any scalar multiples of the hyperplane ; if f(x)=0 then k*f(x)=0 as well. It can be thought of as a normalization condition.
The second condition enforces a stricter condition on the hyperplane. Instead of having the points just on the other side of the classifier (>0) we want the points to lie above the margin (on the right side). This is saying the points must lie atleast a distance M from the hyperplane in either direction.
What about non-linearity?
2 classes with a non-linear separation. (Source : ISL Python)
Non-linearity is a problem we have come back to time and again. We have learnt both parametric and non-parametric methods to tackle non-linearity before. But remember our optimal hyperplane is linear & has 100% training accuracy. How then do we tackle non-linearity?
Our shortcomings leave us with two choices. (1) reduce the training accuracy (2) make transformations of predictors to make decision boundary non-linear. We will look at both these approaches in the order listed above.
Support Vector Classifier
So we now take a hit in training accuracy by allowing for some misclassifications. We do this by slightly modifying the maximal margin classifier.
Here ε represents the relative distance of a point from the hyperplane, relative to the margin. If ε = 0, it is on the right side of the margin. If 0 ≤ ε ≤ 1, then it is on the wrong side of the margin. If ε > 1, then the point has violated the hyperplane. By including (1-ε) in the hyperplane condition, we allow it to accommodate misclassifications of the corresponding magnitude (bigger violations → smaller margins).
The variation in ε. (Source : Springer)
C controls the number of misclassifications that will be tolerated. If C = 0, we simply go back to the maximum margin classifier. As the value of C increases we allow more mistakes, widening the margin. As C decreases the margin becomes narrower. Increasing C increases the bias of the model & reduces it’s variance. In practice the optimal value of C is found out by cross-validation.
Variation in support vector classifier as a function of C. (Source : ISL Python)
In the above plot the largest value of C was used in the top left corner and the smallest value of C was used in the bottom right corner. The value of C decreases between them.
Support Vector Machines
We can deal with non-linear decision boundaries by adding polynomial transformations of the features. This means we will include X², X³,… and other higher powers of X in the feature space. The algorithm will slightly change to accommodate these additions. The exact formulation can be found on page 379 of the textbook.
It is easy to get carried away in expanding the feature space, and unless we are careful we can end up with a lot of features. This can make it computationally infeasible to fit a support vector classifier. The Support Vector Machine (SVM) is computationally feasible way to enlarge the feature space using kernels.
A kernel K is a function that quantifies the similarity between two observations. We can then represent the hyperplane in terms of K. When the support vector classifier is combined with a non-linear kernel, we get a SVM.
Some popular kernels are shown below
An SVM with a linear kernel is essentially the support vector classifier. The polynomial kernel can be used for flexible decision boundaries. The radial kernel penalizes points that are far away (in Euclidean distance).
SVM for different kernels. (Source : ISL Python)
In the left hand panel a SVM with a polynomial kernel of degree 3 is applied. In the right hand panel a SVM with a radial kernel is applied. In this case, either kernel can capture the decision boundary accurately.
The advantage of kernels is that we only need to compute K(x_i,x_i’) for n(n-1)/2 distinct pairs i, i’. This can be done without working in the enlarged feature space. This is especially useful since in many applications of the SVM, the enlarged feature space is so large that computations are impossible.
Working with multiple classes
So far we have only worked with binary classification. But how can we extend the SVM to accommodate multiple classes? It turns out that the idea of a separating hyperplane doesn’t extend well when we have multiple classes. We briefly discuss the two most popular methods. We will assume that there are K>2 classes, let x* be the test observation
(a) One-versus-One
- Construct K(K-1)/2 SVMs that compare a pair of classes. For example compare class i (+1) and class i+1 (-1)
- Classify x* using each of the classifiers and then tally the number of times it is assigned to a particular class
- Final classification is done to the majority class
(b) One-versus-All
- Fit K SVMs comparing one of the K classes to the remaining K-1
- Compute f(x*) for each of the K models
- Assign x* to the class for which f(x*) is largest
Relationship to ridge regression & bigger picture
It turns out that there is another way that the support vector classifier equation can be re written, making it’s form quite similar to ridge regression.
Here λ is a non-negative parameter that is directly proportional to C. A small value of λ results in a support vector classifier with a narrow margin ; high variance and low bias. This is very similar to the formulation of Lasso and Ridge regression, infact the second term is common for the support vector classifier and ridge regression!
While the specific functions may change, the bigger picture is some function of the form Loss + Penalty, a function we aim to minimize.
This then, is the essence of machine learning ; the bias-variance trade off. Finding an optimal solution to a problem, without undershooting or overshooting.
SVMs work well in many cases tare toted to be the best “out-of-the-box” method. With another classification algorithm under our belts, we finally move to the chapter everybody & their dog wants to read ; neural networks 🧠
Leave a comment