Writing ml
ml 11 min read 31 December 2022

Support Vector Machines: Geometry, Kernels, and Practical Tradeoffs

SVMs from first principles — the margin maximization objective, soft margins, the kernel trick, and the practical cases where SVMs outperform and where they don't.

The Geometric Intuition

SVMs find a separating hyperplane $\pi$ that maximizes the margin — the distance between two parallel planes $\pi^+$ and $\pi^-$, where:

The hyperplane is defined as $\pi: w^T x + b = 0$, with $\pi^+: w^T x + b = 1$ and $\pi^-: w^T x + b = -1$. The margin width is $\frac{2}{|w|}$.

Why maximize margin? A classifier with the maximum margin is the one with the most “room” on either side of the decision boundary — it is least sensitive to small perturbations in the data.

A useful geometric construction: compute the convex hull (smallest convex polygon containing all points) for each class separately. Find the shortest line connecting the two convex hulls. The perpendicular bisector of this line is the margin-maximizing hyperplane.

Hard Margin SVM

The optimization problem for linearly separable data:

$$\min_{w, b} \frac{|w|}{2} \quad \text{s.t.} \quad y_i(w^T x_i + b) \ge 1 \quad \forall i$$

This requires all points to be correctly classified with margin ≥ 1 — which fails if the classes overlap or there are outliers.

Soft Margin SVM

The practical version introduces slack variables $\zeta_i \ge 0$ that allow points to violate the margin:

$$\min_{w, b} \frac{|w|}{2} + C \frac{1}{n} \sum_i \zeta_i \quad \text{s.t.} \quad y_i(w^T x_i + b) \ge 1 - \zeta_i$$

The term $\frac{1}{n}\sum_i \zeta_i$ is the average hinge loss: $\max(0, 1 - y_i(w^T x_i + b))$.

The Dual Form and the Kernel Trick

Converting the primal form to its Lagrangian dual gives a formulation that only involves dot products:

$$\max_\alpha \sum_i \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j x_i^T x_j$$

The key insight: anywhere $x_i^T x_j$ appears (a dot product / similarity), we can substitute a kernel function $K(x_i, x_j)$ that computes similarity in a high-dimensional feature space without explicitly computing the transformation.

This is the kernel trick: implicit feature mapping into high-dimensional spaces at the cost of computing $K$ — often much cheaper than explicit transformation.

Common Kernels

Polynomial kernel: $$K(x_i, x_j) = (x_i^T x_j + c)^d$$

Corresponds to all polynomial feature interactions up to degree $d$.

Radial Basis Function (RBF) kernel: $$K_{\text{RBF}}(x_i, x_j) = \exp!\left(\frac{-|x_i - x_j|^2}{2\sigma^2}\right)$$

The RBF kernel is similar to KNN: as distance $|x_i - x_j|$ increases, similarity decreases. $\sigma$ controls the width — larger $\sigma$ means a point’s “influence” extends further, analogous to larger $K$ in KNN. RBF SVM is an efficient approximation to KNN on large datasets.

Domain-specific kernels: string kernels for text classification, graph kernels for molecular or network data, genomic sequence kernels — the SVM framework applies to any domain where a valid similarity function (satisfying Mercer’s theorem) can be defined.

Complexity

PhaseComplexity
Training$O(n^2)$ — dominated by computing the $n \times n$ kernel matrix
Inference$O(k \cdot d)$ — only support vectors participate ($k$ is typically $\ll n$)

This makes SVMs expensive to train on large datasets but fast at inference — particularly when $k$ (number of support vectors) is small.

Nu-SVM

An alternative parameterization replaces $C$ with $\nu \in [0, 1]$:

This makes $\nu$ more interpretable than $C$ for controlling sparsity and error budget.

When SVMs Shine

When SVMs Struggle

Practical Decision Guide

SituationRecommendation
$n$ small, $d$ largeSVM with linear kernel
$n$ small/medium, $d$ smallSVM with RBF kernel
$n$ largeLogistic regression or gradient boosting
Need probability outputsAdd Platt scaling or switch to logistic regression
Custom similarity definedSVM with custom kernel

Note: logistic regression and linear SVM are theoretically equivalent for linearly separable data; the difference is implementation efficiency. The practical advantage of logistic regression on large datasets is that stochastic gradient descent scales to millions of samples in a way that SVM optimization does not.

svm classification kernel-methods ml supervised-learning
← All articles

Lets collaborate!

Whether you need a quantitative researcher, an machine learning systems builder, or a technical advisor — I'm available for select consulting engagements.

Get in Touch →