Why Clustering Is Hard
Clustering is the task of grouping data points such that points within a group are more similar to each other than to points outside. The difficulty: “similar” is not a single thing. Two points can be close in Euclidean space but belong to fundamentally different clusters (a dense ring vs. a dense center). Three major algorithm families handle this problem with different assumptions — and fail in different ways.
Evaluating Clustering Quality: The Dunn Index
Before choosing an algorithm, you need a way to measure whether the result is good. The Dunn Index formalizes the intuition of “tight clusters, far apart”:
$$\text{Dunn Index} = \frac{\min_{1 \le i < j \le k} d(C_i, C_j)}{\max_{1 \le i \le k} d’(C_i)}$$
where $d(C_i, C_j)$ is the inter-cluster distance between clusters $i$ and $j$, and $d’(C_i)$ is the intra-cluster diameter of cluster $i$. Larger Dunn Index = better clustering — you want the closest inter-cluster gap to exceed the largest intra-cluster spread.
DBSCAN: Density-Based Clustering
The Core Idea
DBSCAN defines clusters as dense regions separated by sparse regions. It classifies every point as one of three types:
- Core point: has ≥
MinPointsneighbors within radiusEps - Border point: within
Epsof a core point but doesn’t have enough neighbors to be core itself - Noise point: neither core nor border — treated as an outlier
A cluster is all points reachable from a core point through density-connected paths.
Algorithm
- For each point $x_i$, label it as core, border, or noise using range queries built on a KD-Tree
- Remove all noise points
- For each unassigned core point, create a new cluster and add all density-connected points
- Assign border points to their nearest core point’s cluster
Hyperparameter Selection
- MinPoints: start at $2 \times \text{dimensionality}$. Increase for noisy data.
- Eps: for each point, compute $d_i$ = distance to the
MinPoints-th nearest neighbor. Sort these values and plot them — the “elbow” in this plot is the rightEps.
Complexity
- Time: $O(n \log n)$ with KD-Tree indexing
- Space: $O(n)$
When It Fails
DBSCAN struggles when clusters have varying density — it uses a global Eps threshold, so a cluster that is less dense than the background will appear as noise. The HDBSCAN variant addresses this by constructing a hierarchy of density levels.
K-Means: Centroid-Based Clustering
The Optimization Problem
K-Means minimizes the total squared distance from each point to its cluster centroid:
$$\min_{{C_j}k} \sum{i=1}^k \sum_{x \in S_i} |x - C_i|^2$$
where $C_i = \frac{1}{|S_i|} \sum_{x_j \in S_i} x_j$ is the centroid of cluster $i$.
Lloyd’s Algorithm
- Initialize: pick $k$ random points as centroids
- Assign: each point to its nearest centroid
- Update: recompute centroids as the mean of assigned points
- Repeat until centroids change by less than a threshold
K-Means++: Smart Initialization
The final clusters depend heavily on initialization. K-Means++ selects initial centroids with probability proportional to their squared distance from the nearest existing centroid — ensuring the initial centroids are spread out across the data:
- Pick $C_1$ randomly from the data
- For each remaining point $x_i$, compute $d_i = \text{dist}(x_i, \text{nearest centroid})^2$
- Sample the next centroid with probability proportional to $d_i$
- Repeat until $k$ centroids are chosen
Complexity
- Training: $O(nkdi)$ where $n$ = points, $k$ = clusters, $d$ = dimensions, $i$ = iterations
- Practically $O(nd)$ since $k$ and $i$ are small
Failure Modes
- Outlier sensitivity: a single outlier can pull a centroid far from the cluster center
- Assumes globular clusters: K-Means partitions space into Voronoi cells, which are convex by construction — it cannot find ring-shaped, crescent, or other non-convex clusters
- Assumes equal cluster size and density: K-Means tries to create equal-sized clusters; very different-sized clusters are mishandled
Workaround: use a larger $k$ than needed, then merge similar clusters post-hoc.
K-Medoids (PAM)
K-Medoids replaces the centroid (which may not be a real data point) with a medoid — an actual point in the dataset that minimizes within-cluster distance. The update step swaps each medoid with a non-medoid point and keeps the swap if it reduces total loss. This makes K-Medoids more interpretable and supports non-Euclidean distance metrics and kernels.
Choosing K: The Elbow Method
Plot the total within-cluster variance vs. $k$. As $k$ increases, variance decreases — but with diminishing returns. The “elbow” where the curve flattens is the appropriate $k$.
Hierarchical Clustering
Agglomerative Approach
Start with each point as its own cluster and iteratively merge the two most similar clusters until one remains. The result is a dendrogram — a tree structure showing the full merge history at every level of granularity. You choose the number of clusters by cutting the dendrogram at a desired height.
Linkage Criteria
How you measure inter-cluster distance defines the resulting cluster shapes:
| Linkage | Formula | Behavior |
|---|---|---|
| Single-link (min) | $\min_{p_i \in C_1, p_j \in C_2} \text{sim}(p_i, p_j)$ | Chaining: merges based on closest pair; sensitive to outliers |
| Complete-link (max) | $\max_{p_i \in C_1, p_j \in C_2} \text{sim}(p_i, p_j)$ | Compact clusters; breaks with very different cluster sizes |
| Average-link | $\frac{\sum \text{sim}(p_i, p_j)}{ | C_1 |
| Ward’s method | Minimizes total within-cluster variance after merge | Tends to produce similarly-sized, compact clusters |
Complexity
- Time: $O(n^2 \log n)$ to $O(n^3)$
- Space: $O(n^2)$ for the proximity matrix
This is why hierarchical clustering is rarely used on large datasets — the memory requirement alone makes it impractical above ~10,000 points.
Choosing the Right Algorithm
| Scenario | Algorithm |
|---|---|
| Unknown cluster shape, noisy data | DBSCAN |
| Large dataset, globular clusters expected | K-Means++ |
| Need interpretable medoid representatives | K-Medoids |
| Need full dendrogram / no predetermined K | Agglomerative |
| Very large dataset | K-Means (most scalable) |