Nilson Chapagain

Support vector machine graphical representation

Image credit: García-Gonzalo et al. (2016)

Support Vector Machines (SVMs) are a powerful and versatile tool in machine learning, designed to tackle both linear and nonlinear classification problems. At their core, SVMs aim to find the optimal hyperplane that not only separates different classes but does so with the maximum margin. This margin, representing the smallest distance between the hyperplane and any data point, is key to SVM’s robustness and generalization capabilities.

In this blog, we'll delve into the essential components of SVMs, including:

Calculating the distance from a point to a hyperplane and defining the margin.
Formulating the hard-margin SVM and understanding how it finds the optimal separating hyperplane.
Extending to soft-margin SVM to handle misclassifications and margin violations.
Exploring the kernel trick, a powerful technique that enables SVMs to:
    🔹 Efficiently classify complex, nonlinear data.
    🔹 Map data into higher-dimensional spaces.
    🔹 Avoid explicitly computing high-dimensional transformations.

Model Specification

Let $\{(x_i, y_i)\}_{i=1}^n$ be our training set, with $x_i \in \mathbb{R}^p$ and $y_i \in \{-1, +1\}$ . We define:

$w \in \mathbb{R}^p$ as the normal vector,
$b \in \mathbb{R}$ as the bias,
the margin $\gamma$ as the minimum distance from the hyperplane to any data point.

In an SVM, the decision rule is given by the sign of $w^T x + b$ . If $w^T x + b > 0$ , then $y=+1$ , otherwise $y=-1$ . This simple yet powerful model underlies the classification process.

Margin and Perfect Separation

Background: Distance from a Point to a Hyperplane

In three-dimensional space, the perpendicular distance $d$ from a point $(x_1, y_1, z_1)$ to a plane

Ax + By + Cz + D = 0

is given by

d = \frac{|Ax + By + Cz + D|}{\sqrt{A^2 + B^2 + C^2}}.

Generalizing this to $\mathbb{R}^p$ , consider the hyperplane

w^\top x + b = 0,

where $w \in \mathbb{R}^p$ is the normal vector and $b \in \mathbb{R}$ is the bias term. For any point $x_0$ , its perpendicular distance to this hyperplane is

\frac{|w^\top x_0 + b|}{\|w\|}.

Margin derivation

For a point $x_i$ with label $y_i$ , its signed distance to the hyperplane is

\frac{y_i\bigl(w^\top x_i + b\bigr)}{\|w\|}.

To ensure correct classification with a buffer (the margin), we impose

y_i\bigl(w^\top x_i + b\bigr) \ge 1,\quad i=1,\dots,n.

Under this scaling, the margin is

\gamma = \min_{i} \frac{y_i\bigl(w^\top x_i + b\bigr)}{\|w\|} = \frac{1}{\|w\|}.

Thus, maximizing the margin reduces to minimizing $\|w\|$ (or $\tfrac12\|w\|^2$ for mathematical convenience).

Hard-Margin SVM

If the training data are perfectly separable, we can write the constraints as

y_i\bigl(w^\top x_i + b\bigr) \ge 1,\quad i=1,\dots,n.

Then the hard-margin SVM optimization problem is

\min_{w,b} \; \frac12 \|w\|^2 \quad \text{subject to} \quad y_i\bigl(w^\top x_i + b\bigr) \ge 1,\quad i=1,\dots,n.

Derivation of the Hinge Loss

In practice, data are rarely perfectly separable. To accommodate misclassifications or margin violations, we introduce slack variables $\xi_i \geq 0$ . The modified constraint is

y_i\bigl(w^\top x_i + b\bigr) \ge 1 - \xi_i,\quad i=1,\dots,n.

The slack variable $\xi_i$ quantifies how much the constraint is violated. If the constraint is met exactly or exceeded, then $\xi_i = 0$ ; if not, $\xi_i$ is the shortfall. Then, this violation can be written as

\xi_i = \max\Bigl\{0,\,1 - y_i\bigl(w^\top x_i + b\bigr)\Bigr\}.

This expression is known as the hinge loss:

L_{\text{hinge}}\bigl(y_i, f(x_i)\bigr) = \max\Bigl\{0,\,1 - y_i\bigl(w^\top x_i + b\bigr)\Bigr\}.

It is zero when $y_i(w^\top x_i + b) \geq 1$ (i.e., when the data point is correctly classified with a sufficient margin) and increases linearly otherwise.

Soft-Margin SVM (Convex Optimization)

To build a formulation that penalizes margin violations, we incorporate the slack variables directly into the objective function. This is done using convex optimization by combining constraints and penalties into a single objective via a Lagrangian-type formulation. The resulting soft-margin SVM optimization problem is:

\min_{w,b,\{\xi_i\}} \quad \frac12 \|w\|^2 + C \sum_{i=1}^n \xi_i \quad \text{subject to} \quad \xi_i \ge 0,\quad y_i\bigl(w^\top x_i + b\bigr) \ge 1 - \xi_i,\quad i=1,\dots,n.

where $C > 0$ is a parameter that balances the trade-off between maximizing the margin and minimizing the misclassification error.

Using our derivation of the hinge loss, we can equivalently express the penalty term as:
$C \sum_{i=1}^{n} \max\{0,\, 1 - y_i(w^\top x_i + b)\}$ .
Thus, the soft-margin SVM optimization problem becomes:

\min_{w,b} \quad \frac12 \|w\|^2 + C \sum_{i=1}^n \max\Bigl\{0,\,1 - y_i\bigl(w^\top x_i + b\bigr)\Bigr\}.

Because the objective is convex, it can be solved via methods such as quadratic programming or gradient-based algorithms.

The Kernel Trick: Handling Nonlinear Boundaries

Some datasets cannot be well separated by a hyperplane in the original feature space. To address this, we map data into a (possibly high-dimensional) feature space using a transformation $\phi$ :

\phi: \mathbb{R}^p \to \mathcal{H}, \quad x \mapsto \phi(x).

This notation is read as: “The function $\phi$ takes an input $x$ from the original $p$ -dimensional space and transforms it into a new representation $\phi(x)$ in a higher-dimensional Hilbert space $\mathcal{H}$ .” Here:

$\mathbb{R}^p$ is the original input space where our data lives.
$\mathcal{H}$ is a Hilbert space (often high-dimensional or even infinite-dimensional) in which the data may become linearly separable.
The arrow $x \mapsto \phi(x)$ indicates that each data point $x$ is transformed into $\phi(x)$ .

In the feature space $\mathcal{H}$ , the SVM decision function becomes

w^\top \phi(x) + b = 0.

Directly computing $\phi(x)$ might be impractical if $\mathcal{H}$ is very large. Instead, we define a kernel function $K$ such that

K(x, x') = \phi(x)^\top \phi(x').

This allows us to compute dot products in $\mathcal{H}$ using only the original input vectors, avoiding explicit computation of $\phi(x)$ . An RKHS (Reproducing Kernel Hilbert Space) is a Hilbert space of functions where point evaluation is continuous. This kernel function $K(x,x')$ implicitly defines an RKHS (Reproducing Kernel Hilbert Space), where it acts as an inner product and provides the theoretical foundation for the kernel trick by ensuring that $K(x,x') = \phi(x)^T \phi(x')$ corresponds to an inner product in a high-dimensional space.

Primal and Dual Formulations of SVM

In optimization, the primal problem is the original formulation where we directly minimize an objective function over the decision variables subject to constraints. For example, a general constrained problem can be written as:

\begin{aligned} \text{minimize } & f(x) \\ \text{subject to } & g_i(x) \le 0,\quad i = 1, \dots, m, \\ & h_j(x) = 0,\quad j = 1, \dots, p. \end{aligned}

The dual problem is derived by introducing Lagrange multipliers to form a lower bound on the primal objective—this is done by taking the infimum of the Lagrangian over the primal variables. Under conditions such as convexity and Slater’s condition, strong duality holds, meaning the optimal values of the primal and dual problems are equal. In the case of a soft-margin SVM, the primal formulation seeks to determine the weight vector $w$ , bias $b$ , and slack variables $\xi$ to balance maximizing the margin and minimizing classification errors. Its objective is:

\min_{w,b,\xi}\quad \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}\xi_i,

subject to

1 - \xi_i - y_i(w^T x_i + b) \le 0 \quad \text{and} \quad -\xi_i \le 0,\quad i=1,\dots,n.

By introducing Lagrange multipliers $\alpha_i \ge 0$ for the margin constraints and $\beta_i \ge 0$ for the slack constraints, we form the Lagrangian:

\begin{aligned} L(w,b,\xi,\alpha,\beta) = & \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}\xi_i \\ & + \sum_{i=1}^n \alpha_i \Bigl(1 - \xi_i - y_i(w^T x_i + b)\Bigr) - \sum_{i=1}^n \beta_i\,\xi_i. \end{aligned}

By differentiating the Lagrangian $L$ with respect to the primal variables and setting the derivatives to zero, we obtain the necessary optimality conditions known as the Karush-Kuhn-Tucker (KKT) conditions that must be satisfied at the optimal solution of the constrained problem. For the SVM problem, these KKT conditions are as follows:

Differentiating with respect to $w$ :

\frac{\partial L}{\partial w} = w - \sum_{i=1}^n \alpha_i\,y_i\,x_i = 0 \quad \Longrightarrow \quad w = \sum_{i=1}^{n} \alpha_i y_i x_i.

Differentiating with respect to $b$ :

\frac{\partial L}{\partial b} = - \sum_{i=1}^n \alpha_i\,y_i = 0 \quad \Longrightarrow \quad \sum_{i=1}^{n} \alpha_i y_i = 0.

Differentiating with respect to each $\xi_i$ :

\frac{\partial L}{\partial \xi_i} = C - \alpha_i - \beta_i = 0 \quad \Longrightarrow \quad \beta_i = C - \alpha_i.

Since $\beta_i \ge 0$ , it follows that $0 \le \alpha_i \le C$ . This dual formulation expresses the optimization solely in terms of the Lagrange multipliers and inner products between data points, paving the way for the kernel trick.

Objective Function with a Kernel

To handle nonlinear boundaries, we map data into a high-dimensional feature space using a function $\phi$ , and define the kernel function $K(x,x') = \phi(x)^T \phi(x')$ . In this space, the soft-margin SVM primal objective becomes:

\min_{w,b}\quad \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}\max\{0,\, 1 - y_i(w^T \phi(x_i) + b)\}.

Here, the term $\|w\|^2$ controls the margin's width. Following the dual derivation, by enforcing the constraints with Lagrange multipliers and applying the KKT conditions, the weight vector in the feature space is expressed as:

w = \sum_{i=1}^{n} \alpha_i y_i \phi(x_i).

Substituting this expression back into the norm term yields:

\|w\|^2 = \sum_{i,j=1}^{n} \alpha_i \alpha_j y_i y_j \phi(x_i)^T \phi(x_j) = \sum_{i,j=1}^{n} \alpha_i \alpha_j y_i y_j K(x_i,x_j).

Thus, the kernelized dual formulation is:

\max_{\alpha}\quad \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i,j=1}^{n} \alpha_i \alpha_j y_i y_j K(x_i,x_j),

subject to $0 \le \alpha_i \le C$ and $\sum_{i=1}^{n}\alpha_i y_i = 0$ . This derivation shows how the $L_2$ regularization in the primal problem is transformed into kernel evaluations in the dual, enabling SVMs to efficiently handle nonlinear boundaries without explicitly computing $\phi(x)$ .

Examples of Common Kernels

Polynomial Kernel: $K(x, x') = \bigl(\alpha\, x^\top x' + c\bigr)^d,$ where $\alpha$ , $c$ , and $d$ are hyperparameters.

Radial Basis Function (RBF) Kernel: $K(x, x') = \exp\Bigl(-\gamma\,\|x - x'\|^2\Bigr),$ with $\gamma > 0$ controlling the spread.

What happens:
- In training, the dual formulation of the SVM uses only dot products, which are computed as $K(x_i, x_j)$ .
- The final classifier depends solely on these kernel evaluations rather than on explicit high-dimensional feature vectors.

What does not happen:
- We do not build or store the high-dimensional vectors $\phi(x)$ .
- We do not directly compute large dot products $\phi(x_i) \cdot \phi(x_j)$ in the transformed space.

Recap

Distance to Hyperplane: $d = \frac{|w^\top x + b|}{\|w\|}.$

Margin:
Imposing $y_i\bigl(w^\top x_i + b\bigr) \ge 1$ leads to $\gamma = \frac{1}{\|w\|}$ .

Hard-Margin SVM: $\min_{w,b} \; \frac12 \|w\|^2 \quad \text{s.t.} \quad y_i\bigl(w^\top x_i + b\bigr) \ge 1.$

Hinge Loss: $\max\{0,\,1 - y_i\bigl(w^\top x_i + b\bigr)\}.$

Soft-Margin SVM: $\min_{w,b} \; \frac12 \|w\|^2 + C \sum_{i=1}^n \max\{0,\,1 - y_i\bigl(w^\top x_i + b\bigr)\}.$

Kernel Trick: $K(x,x') = \phi(x)^\top \phi(x'),$ which allows us to implicitly map data to a high-dimensional space without explicitly computing $\phi(x)$ .

Kernelized Dual Form: $\max_{\alpha}\quad \sum_{i=1}^{n} \alpha_i - \frac{1}{2}\sum_{i,j=1}^{n} \alpha_i \alpha_j y_i y_j K(x_i,x_j),$ subject to $0 \le \alpha_i \le C$ and $\sum_{i=1}^{n}\alpha_i y_i = 0$ .