Support Vector Machine

ClassificationSupport Vector Machines (SVMs)Kernel MethodsMachine LearningNonlinear Classification
Support vector machine graphical representation

Image credit: GarcΓ­a-Gonzalo et al. (2016)

Support Vector Machines (SVMs) are a powerful and versatile tool in machine learning, designed to tackle both linear and nonlinear classification problems. At their core, SVMs aim to find the optimal hyperplane that not only separates different classes but does so with the maximum margin. This margin, representing the smallest distance between the hyperplane and any data point, is key to SVM’s robustness and generalization capabilities.

In this blog, we'll delve into the essential components of SVMs, including:

  • Calculating the distance from a point to a hyperplane and defining the margin.
  • Formulating the hard-margin SVM and understanding how it finds the optimal separating hyperplane.
  • Extending to soft-margin SVM to handle misclassifications and margin violations.
  • Exploring the kernel trick, a powerful technique that enables SVMs to:
        πŸ”Ή Efficiently classify complex, nonlinear data.
        πŸ”Ή Map data into higher-dimensional spaces.
        πŸ”Ή Avoid explicitly computing high-dimensional transformations.

Model Specification

Let {(xi,yi)}i=1n\{(x_i, y_i)\}_{i=1}^n be our training set, with xi∈Rpx_i \in \mathbb{R}^p and yi∈{βˆ’1,+1}y_i \in \{-1, +1\}. We define:

  • w∈Rpw \in \mathbb{R}^p as the normal vector,
  • b∈Rb \in \mathbb{R} as the bias,
  • the margin Ξ³\gamma as the minimum distance from the hyperplane to any data point.

In an SVM, the decision rule is given by the sign of wTx+bw^T x + b. If wTx+b>0w^T x + b > 0, then y=+1y=+1, otherwise y=βˆ’1y=-1. This simple yet powerful model underlies the classification process.

Margin and Perfect Separation

Background: Distance from a Point to a Hyperplane

In three-dimensional space, the perpendicular distance dd from a point (x1,y1,z1)(x_1, y_1, z_1) to a plane

Ax+By+Cz+D=0Ax + By + Cz + D = 0

is given by

d=∣Ax+By+Cz+D∣A2+B2+C2.d = \frac{|Ax + By + Cz + D|}{\sqrt{A^2 + B^2 + C^2}}.

Generalizing this to Rp\mathbb{R}^p, consider the hyperplane

w⊀x+b=0,w^\top x + b = 0,

where w∈Rpw \in \mathbb{R}^p is the normal vector and b∈Rb \in \mathbb{R} is the bias term. For any point x0x_0, its perpendicular distance to this hyperplane is

∣w⊀x0+b∣βˆ₯wβˆ₯.\frac{|w^\top x_0 + b|}{\|w\|}.
Margin derivation

For a point xix_i with label yiy_i, its signed distance to the hyperplane is

yi(w⊀xi+b)βˆ₯wβˆ₯.\frac{y_i\bigl(w^\top x_i + b\bigr)}{\|w\|}.

To ensure correct classification with a buffer (the margin), we impose

yi(w⊀xi+b)β‰₯1,i=1,…,n.y_i\bigl(w^\top x_i + b\bigr) \ge 1,\quad i=1,\dots,n.

Under this scaling, the margin is

Ξ³=min⁑iyi(w⊀xi+b)βˆ₯wβˆ₯=1βˆ₯wβˆ₯.\gamma = \min_{i} \frac{y_i\bigl(w^\top x_i + b\bigr)}{\|w\|} = \frac{1}{\|w\|}.

Thus, maximizing the margin reduces to minimizing βˆ₯wβˆ₯\|w\| (or 12βˆ₯wβˆ₯2\tfrac12\|w\|^2 for mathematical convenience).

Hard-Margin SVM

If the training data are perfectly separable, we can write the constraints as

yi(w⊀xi+b)β‰₯1,i=1,…,n.y_i\bigl(w^\top x_i + b\bigr) \ge 1,\quad i=1,\dots,n.

Then the hard-margin SVM optimization problem is

min⁑w,bβ€…β€Š12βˆ₯wβˆ₯2subject toyi(w⊀xi+b)β‰₯1,i=1,…,n.\min_{w,b} \; \frac12 \|w\|^2 \quad \text{subject to} \quad y_i\bigl(w^\top x_i + b\bigr) \ge 1,\quad i=1,\dots,n.

Derivation of the Hinge Loss

In practice, data are rarely perfectly separable. To accommodate misclassifications or margin violations, we introduce slack variables ΞΎiβ‰₯0\xi_i \geq 0. The modified constraint is

yi(w⊀xi+b)β‰₯1βˆ’ΞΎi,i=1,…,n.y_i\bigl(w^\top x_i + b\bigr) \ge 1 - \xi_i,\quad i=1,\dots,n.

The slack variable ΞΎi\xi_i quantifies how much the constraint is violated. If the constraint is met exactly or exceeded, then ΞΎi=0\xi_i = 0; if not, ΞΎi\xi_i is the shortfall. Then, this violation can be written as

ΞΎi=max⁑{0, 1βˆ’yi(w⊀xi+b)}.\xi_i = \max\Bigl\{0,\,1 - y_i\bigl(w^\top x_i + b\bigr)\Bigr\}.

This expression is known as the hinge loss:

Lhinge(yi,f(xi))=max⁑{0, 1βˆ’yi(w⊀xi+b)}.L_{\text{hinge}}\bigl(y_i, f(x_i)\bigr) = \max\Bigl\{0,\,1 - y_i\bigl(w^\top x_i + b\bigr)\Bigr\}.

It is zero when yi(w⊀xi+b)β‰₯1y_i(w^\top x_i + b) \geq 1 (i.e., when the data point is correctly classified with a sufficient margin) and increases linearly otherwise.

Soft-Margin SVM (Convex Optimization)

To build a formulation that penalizes margin violations, we incorporate the slack variables directly into the objective function. This is done using convex optimization by combining constraints and penalties into a single objective via a Lagrangian-type formulation. The resulting soft-margin SVM optimization problem is:

min⁑w,b,{ΞΎi}12βˆ₯wβˆ₯2+Cβˆ‘i=1nΞΎisubject toΞΎiβ‰₯0,yi(w⊀xi+b)β‰₯1βˆ’ΞΎi,i=1,…,n.\min_{w,b,\{\xi_i\}} \quad \frac12 \|w\|^2 + C \sum_{i=1}^n \xi_i \quad \text{subject to} \quad \xi_i \ge 0,\quad y_i\bigl(w^\top x_i + b\bigr) \ge 1 - \xi_i,\quad i=1,\dots,n.

where C>0C > 0 is a parameter that balances the trade-off between maximizing the margin and minimizing the misclassification error.

Using our derivation of the hinge loss, we can equivalently express the penalty term as:
Cβˆ‘i=1nmax⁑{0, 1βˆ’yi(w⊀xi+b)}C \sum_{i=1}^{n} \max\{0,\, 1 - y_i(w^\top x_i + b)\}.
Thus, the soft-margin SVM optimization problem becomes:

min⁑w,b12βˆ₯wβˆ₯2+Cβˆ‘i=1nmax⁑{0, 1βˆ’yi(w⊀xi+b)}.\min_{w,b} \quad \frac12 \|w\|^2 + C \sum_{i=1}^n \max\Bigl\{0,\,1 - y_i\bigl(w^\top x_i + b\bigr)\Bigr\}.

Because the objective is convex, it can be solved via methods such as quadratic programming or gradient-based algorithms.

The Kernel Trick: Handling Nonlinear Boundaries

Some datasets cannot be well separated by a hyperplane in the original feature space. To address this, we map data into a (possibly high-dimensional) feature space using a transformation Ο•\phi:

Ο•:Rpβ†’H,x↦ϕ(x).\phi: \mathbb{R}^p \to \mathcal{H}, \quad x \mapsto \phi(x).

This notation is read as: β€œThe function Ο•\phi takes an input xx from the original pp-dimensional space and transforms it into a new representation Ο•(x)\phi(x) in a higher-dimensional Hilbert space H\mathcal{H}.” Here:

  • Rp\mathbb{R}^p is the original input space where our data lives.
  • H\mathcal{H} is a Hilbert space (often high-dimensional or even infinite-dimensional) in which the data may become linearly separable.
  • The arrow x↦ϕ(x)x \mapsto \phi(x) indicates that each data point xx is transformed into Ο•(x)\phi(x).

In the feature space H\mathcal{H}, the SVM decision function becomes

wβŠ€Ο•(x)+b=0.w^\top \phi(x) + b = 0.

Directly computing Ο•(x)\phi(x) might be impractical if H\mathcal{H} is very large. Instead, we define a kernel function KK such that

K(x,xβ€²)=Ο•(x)βŠ€Ο•(xβ€²).K(x, x') = \phi(x)^\top \phi(x').

This allows us to compute dot products in H\mathcal{H} using only the original input vectors, avoiding explicit computation of Ο•(x)\phi(x). An RKHS (Reproducing Kernel Hilbert Space) is a Hilbert space of functions where point evaluation is continuous. This kernel function K(x,xβ€²)K(x,x') implicitly defines an RKHS (Reproducing Kernel Hilbert Space), where it acts as an inner product and provides the theoretical foundation for the kernel trick by ensuring that K(x,xβ€²)=Ο•(x)TΟ•(xβ€²)K(x,x') = \phi(x)^T \phi(x') corresponds to an inner product in a high-dimensional space.

Primal and Dual Formulations of SVM

In optimization, the primal problem is the original formulation where we directly minimize an objective function over the decision variables subject to constraints. For example, a general constrained problem can be written as:

minimize f(x)subject to gi(x)≀0,i=1,…,m,hj(x)=0,j=1,…,p.\begin{aligned} \text{minimize } & f(x) \\ \text{subject to } & g_i(x) \le 0,\quad i = 1, \dots, m, \\ & h_j(x) = 0,\quad j = 1, \dots, p. \end{aligned}

The dual problem is derived by introducing Lagrange multipliers to form a lower bound on the primal objectiveβ€”this is done by taking the infimum of the Lagrangian over the primal variables. Under conditions such as convexity and Slater’s condition, strong duality holds, meaning the optimal values of the primal and dual problems are equal. In the case of a soft-margin SVM, the primal formulation seeks to determine the weight vector ww, bias bb, and slack variables ΞΎ\xi to balance maximizing the margin and minimizing classification errors. Its objective is:

min⁑w,b,ΞΎ12βˆ₯wβˆ₯2+Cβˆ‘i=1nΞΎi,\min_{w,b,\xi}\quad \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}\xi_i,

subject to

1βˆ’ΞΎiβˆ’yi(wTxi+b)≀0andβˆ’ΞΎi≀0,i=1,…,n.1 - \xi_i - y_i(w^T x_i + b) \le 0 \quad \text{and} \quad -\xi_i \le 0,\quad i=1,\dots,n.

By introducing Lagrange multipliers Ξ±iβ‰₯0\alpha_i \ge 0 for the margin constraints and Ξ²iβ‰₯0\beta_i \ge 0 for the slack constraints, we form the Lagrangian:

L(w,b,ΞΎ,Ξ±,Ξ²)=12βˆ₯wβˆ₯2+Cβˆ‘i=1nΞΎi+βˆ‘i=1nΞ±i(1βˆ’ΞΎiβˆ’yi(wTxi+b))βˆ’βˆ‘i=1nΞ²i ξi.\begin{aligned} L(w,b,\xi,\alpha,\beta) = & \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}\xi_i \\ & + \sum_{i=1}^n \alpha_i \Bigl(1 - \xi_i - y_i(w^T x_i + b)\Bigr) - \sum_{i=1}^n \beta_i\,\xi_i. \end{aligned}

By differentiating the Lagrangian LL with respect to the primal variables and setting the derivatives to zero, we obtain the necessary optimality conditions known as the Karush-Kuhn-Tucker (KKT) conditions that must be satisfied at the optimal solution of the constrained problem. For the SVM problem, these KKT conditions are as follows:

Differentiating with respect to ww:

βˆ‚Lβˆ‚w=wβˆ’βˆ‘i=1nΞ±i yi xi=0⟹w=βˆ‘i=1nΞ±iyixi.\frac{\partial L}{\partial w} = w - \sum_{i=1}^n \alpha_i\,y_i\,x_i = 0 \quad \Longrightarrow \quad w = \sum_{i=1}^{n} \alpha_i y_i x_i.

Differentiating with respect to bb:

βˆ‚Lβˆ‚b=βˆ’βˆ‘i=1nΞ±i yi=0βŸΉβˆ‘i=1nΞ±iyi=0.\frac{\partial L}{\partial b} = - \sum_{i=1}^n \alpha_i\,y_i = 0 \quad \Longrightarrow \quad \sum_{i=1}^{n} \alpha_i y_i = 0.

Differentiating with respect to each ΞΎi\xi_i:

βˆ‚Lβˆ‚ΞΎi=Cβˆ’Ξ±iβˆ’Ξ²i=0⟹βi=Cβˆ’Ξ±i.\frac{\partial L}{\partial \xi_i} = C - \alpha_i - \beta_i = 0 \quad \Longrightarrow \quad \beta_i = C - \alpha_i.

Since Ξ²iβ‰₯0\beta_i \ge 0, it follows that 0≀αi≀C0 \le \alpha_i \le C. This dual formulation expresses the optimization solely in terms of the Lagrange multipliers and inner products between data points, paving the way for the kernel trick.


Objective Function with a Kernel

To handle nonlinear boundaries, we map data into a high-dimensional feature space using a function Ο•\phi, and define the kernel function K(x,xβ€²)=Ο•(x)TΟ•(xβ€²)K(x,x') = \phi(x)^T \phi(x'). In this space, the soft-margin SVM primal objective becomes:

min⁑w,b12βˆ₯wβˆ₯2+Cβˆ‘i=1nmax⁑{0, 1βˆ’yi(wTΟ•(xi)+b)}.\min_{w,b}\quad \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}\max\{0,\, 1 - y_i(w^T \phi(x_i) + b)\}.

Here, the term βˆ₯wβˆ₯2\|w\|^2 controls the margin's width. Following the dual derivation, by enforcing the constraints with Lagrange multipliers and applying the KKT conditions, the weight vector in the feature space is expressed as:

w=βˆ‘i=1nΞ±iyiΟ•(xi).w = \sum_{i=1}^{n} \alpha_i y_i \phi(x_i).

Substituting this expression back into the norm term yields:

βˆ₯wβˆ₯2=βˆ‘i,j=1nΞ±iΞ±jyiyjΟ•(xi)TΟ•(xj)=βˆ‘i,j=1nΞ±iΞ±jyiyjK(xi,xj).\|w\|^2 = \sum_{i,j=1}^{n} \alpha_i \alpha_j y_i y_j \phi(x_i)^T \phi(x_j) = \sum_{i,j=1}^{n} \alpha_i \alpha_j y_i y_j K(x_i,x_j).

Thus, the kernelized dual formulation is:

maxβ‘Ξ±βˆ‘i=1nΞ±iβˆ’12βˆ‘i,j=1nΞ±iΞ±jyiyjK(xi,xj),\max_{\alpha}\quad \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i,j=1}^{n} \alpha_i \alpha_j y_i y_j K(x_i,x_j),

subject to 0≀αi≀C0 \le \alpha_i \le C and βˆ‘i=1nΞ±iyi=0\sum_{i=1}^{n}\alpha_i y_i = 0. This derivation shows how the L2L_2 regularization in the primal problem is transformed into kernel evaluations in the dual, enabling SVMs to efficiently handle nonlinear boundaries without explicitly computing Ο•(x)\phi(x).

Examples of Common Kernels

  • Polynomial Kernel: K(x,xβ€²)=(α x⊀xβ€²+c)d,K(x, x') = \bigl(\alpha\, x^\top x' + c\bigr)^d, where Ξ±\alpha, cc, and dd are hyperparameters.
  • Radial Basis Function (RBF) Kernel: K(x,xβ€²)=exp⁑(βˆ’Ξ³β€‰βˆ₯xβˆ’xβ€²βˆ₯2),K(x, x') = \exp\Bigl(-\gamma\,\|x - x'\|^2\Bigr), with Ξ³>0\gamma > 0 controlling the spread.
  • What happens:
    • In training, the dual formulation of the SVM uses only dot products, which are computed as K(xi,xj)K(x_i, x_j).
    • The final classifier depends solely on these kernel evaluations rather than on explicit high-dimensional feature vectors.
  • What does not happen:
    • We do not build or store the high-dimensional vectors Ο•(x)\phi(x).
    • We do not directly compute large dot products Ο•(xi)β‹…Ο•(xj)\phi(x_i) \cdot \phi(x_j) in the transformed space.

Recap

  • Distance to Hyperplane: d=∣w⊀x+b∣βˆ₯wβˆ₯.d = \frac{|w^\top x + b|}{\|w\|}.
  • Margin:
    Imposing yi(w⊀xi+b)β‰₯1y_i\bigl(w^\top x_i + b\bigr) \ge 1 leads to Ξ³=1βˆ₯wβˆ₯\gamma = \frac{1}{\|w\|}.
  • Hard-Margin SVM: min⁑w,bβ€…β€Š12βˆ₯wβˆ₯2s.t.yi(w⊀xi+b)β‰₯1.\min_{w,b} \; \frac12 \|w\|^2 \quad \text{s.t.} \quad y_i\bigl(w^\top x_i + b\bigr) \ge 1.
  • Hinge Loss: max⁑{0, 1βˆ’yi(w⊀xi+b)}.\max\{0,\,1 - y_i\bigl(w^\top x_i + b\bigr)\}.
  • Soft-Margin SVM: min⁑w,bβ€…β€Š12βˆ₯wβˆ₯2+Cβˆ‘i=1nmax⁑{0, 1βˆ’yi(w⊀xi+b)}.\min_{w,b} \; \frac12 \|w\|^2 + C \sum_{i=1}^n \max\{0,\,1 - y_i\bigl(w^\top x_i + b\bigr)\}.
  • Kernel Trick: K(x,xβ€²)=Ο•(x)βŠ€Ο•(xβ€²),K(x,x') = \phi(x)^\top \phi(x'), which allows us to implicitly map data to a high-dimensional space without explicitly computing Ο•(x)\phi(x).
  • Kernelized Dual Form: maxβ‘Ξ±βˆ‘i=1nΞ±iβˆ’12βˆ‘i,j=1nΞ±iΞ±jyiyjK(xi,xj),\max_{\alpha}\quad \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i,j=1}^{n} \alpha_i \alpha_j y_i y_j K(x_i,x_j), subject to 0≀αi≀C0 \le \alpha_i \le C and βˆ‘i=1nΞ±iyi=0\sum_{i=1}^{n}\alpha_i y_i = 0.