Support Vector Machines (SVMs) are a powerful and versatile tool in machine learning, designed to tackle both linear and nonlinear classification problems. At their core, SVMs aim to find the optimal hyperplane that not only separates different classes but does so with the maximum margin. This margin, representing the smallest distance between the hyperplane and any data point, is key to SVMβs robustness and generalization capabilities.
In this blog, we'll delve into the essential components of SVMs, including:
Calculating the distance from a point to a hyperplane and defining the margin.
Formulating the hard-margin SVM and understanding how it finds the optimal separating hyperplane.
Extending to soft-margin SVM to handle misclassifications and margin violations.
Exploring the kernel trick, a powerful technique that enables SVMs to:
πΉ Efficiently classify complex, nonlinear data.
πΉ Map data into higher-dimensional spaces.
πΉ Avoid explicitly computing high-dimensional transformations.
Model Specification
Let {(xiβ,yiβ)}i=1nβ be our training set, with xiββRp and yiββ{β1,+1}. We define:
wβRp as the normal vector,
bβR as the bias,
the marginΞ³ as the minimum distance from the hyperplane to any data point.
In an SVM, the decision rule is given by the sign of wTx+b. If wTx+b>0, then y=+1, otherwise y=β1. This simple yet powerful model underlies the classification process.
Margin and Perfect Separation
Background: Distance from a Point to a Hyperplane
In three-dimensional space, the perpendicular distance d from a point (x1β,y1β,z1β) to a plane
Ax+By+Cz+D=0
is given by
d=A2+B2+C2ββ£Ax+By+Cz+Dβ£β.
Generalizing this to Rp, consider the hyperplane
wβ€x+b=0,
where wβRp is the normal vector and bβR is the bias term. For any point x0β, its perpendicular distance to this hyperplane is
β₯wβ₯β£wβ€x0β+bβ£β.Margin derivation
For a point xiβ with label yiβ, its signed distance to the hyperplane is
β₯wβ₯yiβ(wβ€xiβ+b)β.
To ensure correct classification with a buffer (the margin), we impose
In practice, data are rarely perfectly separable. To accommodate misclassifications or margin violations, we introduce slack variables ΞΎiββ₯0. The modified constraint is
yiβ(wβ€xiβ+b)β₯1βΞΎiβ,i=1,β¦,n.
The slack variable ΞΎiβ quantifies how much the constraint is violated. If the constraint is met exactly or exceeded, then ΞΎiβ=0; if not, ΞΎiβ is the shortfall. Then, this violation can be written as
It is zero when yiβ(wβ€xiβ+b)β₯1 (i.e., when the data point is correctly classified with a sufficient margin) and increases linearly otherwise.
Soft-Margin SVM (Convex Optimization)
To build a formulation that penalizes margin violations, we incorporate the slack variables directly into the objective function. This is done using convex optimization by combining constraints and penalties into a single objective via a Lagrangian-type formulation. The resulting soft-margin SVM optimization problem is:
where C>0 is a parameter that balances the trade-off between maximizing the margin and minimizing the misclassification error.
Using our derivation of the hinge loss, we can equivalently express the penalty term as: Cβi=1nβmax{0,1βyiβ(wβ€xiβ+b)}.
Thus, the soft-margin SVM optimization problem becomes:
Because the objective is convex, it can be solved via methods such as quadratic programming or gradient-based algorithms.
The Kernel Trick: Handling Nonlinear Boundaries
Some datasets cannot be well separated by a hyperplane in the original feature space. To address this, we map data into a (possibly high-dimensional) feature space using a transformation Ο:
Ο:RpβH,xβ¦Ο(x).
This notation is read as: βThe function Ο takes an input x from the original p-dimensional space and transforms it into a new representation Ο(x) in a higher-dimensional Hilbert space H.β Here:
Rp is the original input space where our data lives.
H is a Hilbert space (often high-dimensional or even infinite-dimensional) in which the data may become linearly separable.
The arrow xβ¦Ο(x) indicates that each data point x is transformed into Ο(x).
In the feature space H, the SVM decision function becomes
wβ€Ο(x)+b=0.
Directly computing Ο(x) might be impractical if H is very large. Instead, we define a kernel functionK such that
K(x,xβ²)=Ο(x)β€Ο(xβ²).
This allows us to compute dot products in H using only the original input vectors, avoiding explicit computation of Ο(x).
An RKHS (Reproducing Kernel Hilbert Space) is a Hilbert space of functions where point evaluation is continuous.
This kernel function K(x,xβ²) implicitly defines an RKHS (Reproducing Kernel Hilbert Space), where it acts as an inner product and provides the theoretical foundation for the kernel trick by ensuring that K(x,xβ²)=Ο(x)TΟ(xβ²) corresponds to an inner product in a high-dimensional space.
Primal and Dual Formulations of SVM
In optimization, the primal problem is the original formulation where we directly minimize an objective function over the decision variables subject to constraints. For example, a general constrained problem can be written as:
minimize subject to βf(x)giβ(x)β€0,i=1,β¦,m,hjβ(x)=0,j=1,β¦,p.β
The dual problem is derived by introducing Lagrange multipliers to form a lower bound on the primal objectiveβthis is done by taking the infimum of the Lagrangian over the primal variables. Under conditions such as convexity and Slaterβs condition, strong duality holds, meaning the optimal values of the primal and dual problems are equal. In the case of a soft-margin SVM, the primal formulation seeks to determine the weight vector w, bias b, and slack variables ΞΎ to balance maximizing the margin and minimizing classification errors. Its objective is:
By differentiating the Lagrangian L with respect to the primal variables and setting the derivatives to zero,
we obtain the necessary optimality conditions known as the Karush-Kuhn-Tucker (KKT) conditions that must be satisfied at the optimal solution of the constrained problem.
For the SVM problem, these KKT conditions are as follows:
Since Ξ²iββ₯0, it follows that 0β€Ξ±iββ€C. This dual formulation expresses the optimization solely in terms of the Lagrange multipliers and inner products between data points, paving the way for the kernel trick.
Objective Function with a Kernel
To handle nonlinear boundaries, we map data into a high-dimensional feature space using a function Ο, and define the kernel function K(x,xβ²)=Ο(x)TΟ(xβ²). In this space, the soft-margin SVM primal objective becomes:
Here, the term β₯wβ₯2 controls the margin's width. Following the dual derivation, by enforcing the constraints with Lagrange multipliers and applying the KKT conditions, the weight vector in the feature space is expressed as:
w=i=1βnβΞ±iβyiβΟ(xiβ).
Substituting this expression back into the norm term yields:
subject to 0β€Ξ±iββ€C and βi=1nβΞ±iβyiβ=0. This derivation shows how the L2β regularization in the primal problem is transformed into kernel evaluations in the dual, enabling SVMs to efficiently handle nonlinear boundaries without explicitly computing Ο(x).
Examples of Common Kernels
Polynomial Kernel:K(x,xβ²)=(Ξ±xβ€xβ²+c)d,
where Ξ±, c, and d are hyperparameters.
Radial Basis Function (RBF) Kernel:K(x,xβ²)=exp(βΞ³β₯xβxβ²β₯2),
with Ξ³>0 controlling the spread.
What happens:
In training, the dual formulation of the SVM uses only dot products, which are computed as K(xiβ,xjβ).
The final classifier depends solely on these kernel evaluations rather than on explicit high-dimensional feature vectors.
What does not happen:
We do not build or store the high-dimensional vectors Ο(x).
We do not directly compute large dot products Ο(xiβ)β Ο(xjβ) in the transformed space.
Recap
Distance to Hyperplane:d=β₯wβ₯β£wβ€x+bβ£β.
Margin:
Imposing yiβ(wβ€xiβ+b)β₯1 leads to Ξ³=β₯wβ₯1β.
Kernel Trick:K(x,xβ²)=Ο(x)β€Ο(xβ²),
which allows us to implicitly map data to a high-dimensional space without explicitly computing Ο(x).
Kernelized Dual Form:Ξ±maxβi=1βnβΞ±iββ21βi,j=1βnβΞ±iβΞ±jβyiβyjβK(xiβ,xjβ),
subject to 0β€Ξ±iββ€C and βi=1nβΞ±iβyiβ=0.