Support Vector Machines: Geometry, Kernels, and Practical Tradeoffs

The Geometric Intuition

SVMs find a separating hyperplane $\pi$ that maximizes the margin — the distance between two parallel planes $\pi^+$ and $\pi^-$, where:

$\pi^+$ passes through the nearest positive point(s) to $\pi$
$\pi^-$ passes through the nearest negative point(s) to $\pi$
The points on $\pi^+$ and $\pi^-$ are the support vectors

The hyperplane is defined as $\pi: w^T x + b = 0$, with $\pi^+: w^T x + b = 1$ and $\pi^-: w^T x + b = -1$. The margin width is $\frac{2}{|w|}$.

Why maximize margin? A classifier with the maximum margin is the one with the most “room” on either side of the decision boundary — it is least sensitive to small perturbations in the data.

A useful geometric construction: compute the convex hull (smallest convex polygon containing all points) for each class separately. Find the shortest line connecting the two convex hulls. The perpendicular bisector of this line is the margin-maximizing hyperplane.

Hard Margin SVM

The optimization problem for linearly separable data:

$$\min_{w, b} \frac{|w|}{2} \quad \text{s.t.} \quad y_i(w^T x_i + b) \ge 1 \quad \forall i$$

This requires all points to be correctly classified with margin ≥ 1 — which fails if the classes overlap or there are outliers.

Soft Margin SVM

The practical version introduces slack variables $\zeta_i \ge 0$ that allow points to violate the margin:

$$\min_{w, b} \frac{|w|}{2} + C \frac{1}{n} \sum_i \zeta_i \quad \text{s.t.} \quad y_i(w^T x_i + b) \ge 1 - \zeta_i$$

$\zeta_i = 0$: point is correctly classified outside the margin
$0 < \zeta_i \le 1$: point is inside the margin but correctly classified
$\zeta_i > 1$: point is misclassified
C controls the tradeoff: high C → penalize violations heavily → smaller margin, risk of overfitting; low C → allow more violations → wider margin, risk of underfitting

The term $\frac{1}{n}\sum_i \zeta_i$ is the average hinge loss: $\max(0, 1 - y_i(w^T x_i + b))$.

The Dual Form and the Kernel Trick

Converting the primal form to its Lagrangian dual gives a formulation that only involves dot products:

$$\max_\alpha \sum_i \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y_i y_j x_i^T x_j$$

The key insight: anywhere $x_i^T x_j$ appears (a dot product / similarity), we can substitute a kernel function $K(x_i, x_j)$ that computes similarity in a high-dimensional feature space without explicitly computing the transformation.

This is the kernel trick: implicit feature mapping into high-dimensional spaces at the cost of computing $K$ — often much cheaper than explicit transformation.

Common Kernels

Polynomial kernel: $$K(x_i, x_j) = (x_i^T x_j + c)^d$$

Corresponds to all polynomial feature interactions up to degree $d$.

Radial Basis Function (RBF) kernel: $$K_{\text{RBF}}(x_i, x_j) = \exp!\left(\frac{-|x_i - x_j|^2}{2\sigma^2}\right)$$

The RBF kernel is similar to KNN: as distance $|x_i - x_j|$ increases, similarity decreases. $\sigma$ controls the width — larger $\sigma$ means a point’s “influence” extends further, analogous to larger $K$ in KNN. RBF SVM is an efficient approximation to KNN on large datasets.

Domain-specific kernels: string kernels for text classification, graph kernels for molecular or network data, genomic sequence kernels — the SVM framework applies to any domain where a valid similarity function (satisfying Mercer’s theorem) can be defined.

Complexity

Phase	Complexity
Training	$O(n^2)$ — dominated by computing the $n \times n$ kernel matrix
Inference	$O(k \cdot d)$ — only support vectors participate ($k$ is typically $\ll n$)

This makes SVMs expensive to train on large datasets but fast at inference — particularly when $k$ (number of support vectors) is small.

Nu-SVM

An alternative parameterization replaces $C$ with $\nu \in [0, 1]$:

$\nu$ is a lower bound on the fraction of margin errors
$\nu$ is an upper bound on the fraction of support vectors

This makes $\nu$ more interpretable than $C$ for controlling sparsity and error budget.

When SVMs Shine

High-dimensional data: SVMs work well when the number of features exceeds the number of samples — they regularize via margin maximization even in very high-dimensional spaces
Clear margin of separation: when classes are well-separated, SVMs find the global optimum margin exactly
Custom similarity: when you have a domain-specific similarity function, the kernel trick makes SVMs the natural choice

When SVMs Struggle

Large $n$: training complexity is $O(n^2)$ — impractical above ~100K samples
Many features with overlap: soft margin helps but convergence is slow
Output probabilities needed: SVMs do not natively output calibrated probabilities; Platt scaling or isotonic regression must be added post-training

Practical Decision Guide

Situation	Recommendation
$n$ small, $d$ large	SVM with linear kernel
$n$ small/medium, $d$ small	SVM with RBF kernel
$n$ large	Logistic regression or gradient boosting
Need probability outputs	Add Platt scaling or switch to logistic regression
Custom similarity defined	SVM with custom kernel

Note: logistic regression and linear SVM are theoretically equivalent for linearly separable data; the difference is implementation efficiency. The practical advantage of logistic regression on large datasets is that stochastic gradient descent scales to millions of samples in a way that SVM optimization does not.