The Geometric Intuition
SVMs find a separating hyperplane $\pi$ that maximizes the margin — the distance between two parallel planes $\pi^+$ and $\pi^-$, where:
- $\pi^+$ passes through the nearest positive point(s) to $\pi$
- $\pi^-$ passes through the nearest negative point(s) to $\pi$
- The points on $\pi^+$ and $\pi^-$ are the support vectors
The hyperplane is defined as $\pi: w^T x + b = 0$, with $\pi^+: w^T x + b = 1$ and $\pi^-: w^T x + b = -1$. The margin width is $\frac{2}{|w|}$.
Why maximize margin? A classifier with the maximum margin is the one with the most “room” on either side of the decision boundary — it is least sensitive to small perturbations in the data.
A useful geometric construction: compute the convex hull (smallest convex polygon containing all points) for each class separately. Find the shortest line connecting the two convex hulls. The perpendicular bisector of this line is the margin-maximizing hyperplane.
Hard Margin SVM
The optimization problem for linearly separable data:
$$\min_{w, b} \frac{|w|}{2} \quad \text{s.t.} \quad y_i(w^T x_i + b) \ge 1 \quad \forall i$$
This requires all points to be correctly classified with margin ≥ 1 — which fails if the classes overlap or there are outliers.
Soft Margin SVM
The practical version introduces slack variables $\zeta_i \ge 0$ that allow points to violate the margin:
$$\min_{w, b} \frac{|w|}{2} + C \frac{1}{n} \sum_i \zeta_i \quad \text{s.t.} \quad y_i(w^T x_i + b) \ge 1 - \zeta_i$$
- $\zeta_i = 0$: point is correctly classified outside the margin
- $0 < \zeta_i \le 1$: point is inside the margin but correctly classified
- $\zeta_i > 1$: point is misclassified
- C controls the tradeoff: high C → penalize violations heavily → smaller margin, risk of overfitting; low C → allow more violations → wider margin, risk of underfitting
The term $\frac{1}{n}\sum_i \zeta_i$ is the average hinge loss: $\max(0, 1 - y_i(w^T x_i + b))$.
The Dual Form and the Kernel Trick
Converting the primal form to its Lagrangian dual gives a formulation that only involves dot products:
$$\max_\alpha \sum_i \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j x_i^T x_j$$
The key insight: anywhere $x_i^T x_j$ appears (a dot product / similarity), we can substitute a kernel function $K(x_i, x_j)$ that computes similarity in a high-dimensional feature space without explicitly computing the transformation.
This is the kernel trick: implicit feature mapping into high-dimensional spaces at the cost of computing $K$ — often much cheaper than explicit transformation.
Common Kernels
Polynomial kernel: $$K(x_i, x_j) = (x_i^T x_j + c)^d$$
Corresponds to all polynomial feature interactions up to degree $d$.
Radial Basis Function (RBF) kernel: $$K_{\text{RBF}}(x_i, x_j) = \exp!\left(\frac{-|x_i - x_j|^2}{2\sigma^2}\right)$$
The RBF kernel is similar to KNN: as distance $|x_i - x_j|$ increases, similarity decreases. $\sigma$ controls the width — larger $\sigma$ means a point’s “influence” extends further, analogous to larger $K$ in KNN. RBF SVM is an efficient approximation to KNN on large datasets.
Domain-specific kernels: string kernels for text classification, graph kernels for molecular or network data, genomic sequence kernels — the SVM framework applies to any domain where a valid similarity function (satisfying Mercer’s theorem) can be defined.
Complexity
| Phase | Complexity |
|---|---|
| Training | $O(n^2)$ — dominated by computing the $n \times n$ kernel matrix |
| Inference | $O(k \cdot d)$ — only support vectors participate ($k$ is typically $\ll n$) |
This makes SVMs expensive to train on large datasets but fast at inference — particularly when $k$ (number of support vectors) is small.
Nu-SVM
An alternative parameterization replaces $C$ with $\nu \in [0, 1]$:
- $\nu$ is a lower bound on the fraction of margin errors
- $\nu$ is an upper bound on the fraction of support vectors
This makes $\nu$ more interpretable than $C$ for controlling sparsity and error budget.
When SVMs Shine
- High-dimensional data: SVMs work well when the number of features exceeds the number of samples — they regularize via margin maximization even in very high-dimensional spaces
- Clear margin of separation: when classes are well-separated, SVMs find the global optimum margin exactly
- Custom similarity: when you have a domain-specific similarity function, the kernel trick makes SVMs the natural choice
When SVMs Struggle
- Large $n$: training complexity is $O(n^2)$ — impractical above ~100K samples
- Many features with overlap: soft margin helps but convergence is slow
- Output probabilities needed: SVMs do not natively output calibrated probabilities; Platt scaling or isotonic regression must be added post-training
Practical Decision Guide
| Situation | Recommendation |
|---|---|
| $n$ small, $d$ large | SVM with linear kernel |
| $n$ small/medium, $d$ small | SVM with RBF kernel |
| $n$ large | Logistic regression or gradient boosting |
| Need probability outputs | Add Platt scaling or switch to logistic regression |
| Custom similarity defined | SVM with custom kernel |
Note: logistic regression and linear SVM are theoretically equivalent for linearly separable data; the difference is implementation efficiency. The practical advantage of logistic regression on large datasets is that stochastic gradient descent scales to millions of samples in a way that SVM optimization does not.