<< Kernels | Lecture List | Generalization Bounds >>
On this page... (hide)
\newcommand{\bx}{\mathbf{x}} \newcommand{\bw}{\mathbf{w}}
Historically, Support Vector Machines (SVMs) were motivated by the notion of the maximum margin separating hyperplane. Before we explore this motivation, which is a bit of a MacGuffin, let's relate it to the other linear classification algorithms: we have seen that Logistic Regression and Boosting minimize a convex upper-bound on the 0/1 loss for binary classification: logistic in one case and exponential in the other. Support Vector Machines minimize the hinge loss -- also an upper-bound on the 0/1 loss.
A few things to note about the hinge loss.
Why all the excitement about this loss? First, it has a nice geometric interpretation (at least in the linearly separable case) as leading to a maximum margin hyperplane separating positive from negative examples. Second, more importantly, it fits particularly well with kernels because many examples will not be a part of the solution, leading to a much more efficient dual representation and algorithms.
Consider the case of linearly separable data. In the toy 2D example below, there are many hyperplanes (lines) that separate the negative examples (blue) from positive (green). If we define the margin of a hyperplane as the distance of the closest point to the decision boundary, the maximum margin hyperplane is shown below.
We will denote our linear classifier as (note the constant term b is separated out)
Since rescaling the classifier by multiplying it by a positive constant does not change the decision boundary, sign(\bw^\top\bx + b) = sign(\gamma (\bw^\top\bx + b)), we will set the scale so that at the points closest to the boundary, \bw^\top\bx + b = \pm 1, as shown in the figure below.
Now we will compute the margin, which is the distance of the points closest to the boundary, using one point from each class, \bx_1 and \bx_2.
Hence,
Note that \frac{\bw}{2||\bw||_2}(\bx_2-\bx_1) is precisely the distance to the hyperplane: the vector \frac{1}{2}(\bx_2-\bx_1) projected on the unit vector \frac{\bw}{||\bw||_2}. Let's restate this more precisely. A rescaled hyperplane classifier satisfies:
(with equality for the closest points) and has margin \frac{1}{||\bw||_2}.
To find the maximal margin hyperplane, we can simply maximize \frac{1}{||\bw||_2}, which is equivalent to minimizing \frac{1}{2}\bw^\top\bw subject to the (scaled) correct classification constraints. The SVM optimization problem (in the separable case) is:
Note on notation: I will use \min/\max instead of \inf/\sup since the continuous nature of the optimization will be obvious from context. This optimization problem has a quadratic objective and linear inequality constraints. Note that only the points closest to the boundary participate in defining the SVM hyperplane. These points are called the support vectors, and we will see that the solution can be expressed only using these points.
In order to get more intuition about the problem, we can look at it from the dual perspective. Langrangian duality, reviewed in recitation, provides the right tool here. We formulate the Langrangian by introducing a non-negative multiplier \alpha_i for each inequality constraint:
Note that if \bw,b is feasible (i.e. satisfy all the constraints), then \max_{\alpha\ge 0} L(\bw,b,\alpha) = \frac{1}{2}\bw^\top\bw since all the coefficients (1-y_i(\bw^\top\bx_i + b)) are negative or zero. When \bw,b are not feasible, \max_{\alpha\ge 0} L(\bw,b,\alpha)= \infty. Assuming that feasible \bw,b exist, \min_{\bw,b} (\max_{\alpha\ge 0} L(\bw,b,\alpha)) solves the original problem, and since the original objective is convex, we can change the order of \min/\max to obtain a dual problem that has the same optimal value: \max_{\alpha\ge 0} (\min_{\bw,b} L(\bw,b,\alpha)). Taking derivatives with respect to \bw and b and setting them to zero, we get:
and
Note the dual representation of \bw: it is a linear combination of the examples, weighted by dual variables \alpha_i. Plugging in the expression for \bw, we get
Since \sum_i \alpha_i y_i = 0, the dual problem is:
This dual optimization problem, as the primal, has a quadratic objective and (a single) linear constraint. Because of the (KKT) optimality conditions, when \bx_i is not the closest to the boundary, \alpha_i is zero. So only the points that support the hyperplane, called support vectors, participate in the solution. Given a solution \alpha to the problem, we can recover the primal via:
We can use the kernel trick again, by assuming a feature map \phi(\bx) such that k(\bx_i,\bx_j) = \phi(\bx_i)^\top\phi(\bx_j). The kernelized dual optimization problem is
Given a solution \alpha to the problem, we can recover
and the prediction:
When the data is not separable, the above optimization will fail. To handle the case of points on the wrong side of the fence, SVMs introduce a penalty. Although we might perhaps like to minimize the number of misclassified points, this is an NP-hard problem. Instead, SVMs penalize points where y_i(\bw^\top \bx +b) < 1 using a linear cost, \max(0,1-y_i(\bw^\top \bx +b)), the hinge function. In the primal optimization problem above, this can be accomplished by adding a slack variable, \xi_i for each example and penalizing their sum, weighted by the positive slack penalty constant C:
For any given \bw,b, since we're minimizing over \xi, we will have \xi_i = \max(0,1-y_i(\bw^\top \bx +b)) for each i.
Taking the dual, we get:
Here's a short derivation. In addition to the \alphas, we introduce a non-negative Lagrange multiplier \lambda_i for each \xi_i \ge 0 constraint.
Plugging in the expression for \bw and using the condition \sum_i \alpha_i y_i = 0 as before, we have
Now using the condition C -\alpha_i - \lambda_i= 0, the last sum vanishes, so we are left with the dual:
We can get rid of \lambda_is by noting that they only appear in the constraint C-\alpha_i - \lambda_i=0, and since they must be non-negative, we can just replace the constraint with C \ge \alpha_i, as in the dual stated above.
Note that the same dual representation properties hold for this version:
Given a solution \alpha to the problem, we can recover
and the prediction:
We will not spend too much time on how to optimize quadratic programs with linear constraints. There is a very large literature on the topic of general convex optimization and SVM optimization in particular, and many off-the-shelf efficient tools have been developed for both. In recent years, simple algorithms (e.g., SMO, subgradient) that find approximate answers to the SVM problem have been shown to work quite well. Here is a an applet for kernel SVMs in 2D.
<< Kernels | Lecture List | Generalization Bounds >>