Clustering looks deceptively simple. Group similar things together. Done. But “similar” is not a single concept. Two points can be close in Euclidean space and still belong to fundamentally different clusters. A dense ring surrounds a dense center. A crescent wraps around a blob. Three major algorithm families handle this problem with different assumptions - and fail in different ways. Know the failure modes before you pick the method.
Evaluating Clustering Quality: The Dunn Index
Before choosing an algorithm, you need a way to measure whether the result is actually good. The Dunn Index formalizes the intuition of tight clusters that are far apart:
$$\text{Dunn Index} = \frac{\min_{1 \le i < j \le k} d(C_i, C_j)}{\max_{1 \le i \le k} d’(C_i)}$$
where $d(C_i, C_j)$ is the inter-cluster distance between clusters $i$ and $j$, and $d’(C_i)$ is the intra-cluster diameter of cluster $i$. Larger Dunn Index = better clustering. You want the closest inter-cluster gap to exceed the largest intra-cluster spread.
DBSCAN: Density-Based Clustering
The Core Idea
DBSCAN defines clusters as dense regions separated by sparse regions. Every point gets classified as one of three types:
- Core point: has at least
MinPointsneighbors within radiusEps - Border point: within
Epsof a core point but lacks enough neighbors to be core itself - Noise point: neither core nor border - treated as an outlier
A cluster is all points reachable from a core point through density-connected paths.
Algorithm
- For each point $x_i$, label it as core, border, or noise using range queries built on a KD-Tree
- Remove all noise points
- For each unassigned core point, create a new cluster and add all density-connected points
- Assign border points to their nearest core point’s cluster
Hyperparameter Selection
- MinPoints: start at $2 \times \text{dimensionality}$. Increase for noisy data.
- Eps: for each point, compute $d_i$ = distance to the
MinPoints-th nearest neighbor. Sort these values and plot them. The elbow in this plot is the rightEps.
Complexity
- Time: $O(n \log n)$ with KD-Tree indexing
- Space: $O(n)$
When It Fails
DBSCAN breaks down when clusters have varying density. It uses a global Eps threshold, so a cluster less dense than the background vanishes into the noise class. The HDBSCAN variant solves this by constructing a hierarchy of density levels.
K-Means: Centroid-Based Clustering
The Optimization Problem
K-Means minimizes the total squared distance from each point to its cluster centroid:
$$\min_{{C_j}k} \sum{i=1}^k \sum_{x \in S_i} |x - C_i|^2$$
where $C_i = \frac{1}{|S_i|} \sum_{x_j \in S_i} x_j$ is the centroid of cluster $i$.
Lloyd’s Algorithm
- Initialize: pick $k$ random points as centroids
- Assign: each point to its nearest centroid
- Update: recompute centroids as the mean of assigned points
- Repeat until centroids change by less than a threshold
K-Means++: Smart Initialization
Final clusters depend heavily on initialization. K-Means++ selects initial centroids with probability proportional to their squared distance from the nearest existing centroid - ensuring the initial centroids are spread out across the data.
- Pick $C_1$ randomly from the data
- For each remaining point $x_i$, compute $d_i = \text{dist}(x_i, \text{nearest centroid})^2$
- Sample the next centroid with probability proportional to $d_i$
- Repeat until $k$ centroids are chosen
Complexity
- Training: $O(nkdi)$ where $n$ = points, $k$ = clusters, $d$ = dimensions, $i$ = iterations
- Practically $O(nd)$ since $k$ and $i$ are small
Failure Modes
- Outlier sensitivity: a single outlier pulls a centroid far from the true cluster center
- Assumes globular clusters: K-Means partitions space into Voronoi cells, which are convex by construction. It cannot find ring-shaped, crescent, or other nonconvex clusters
- Assumes equal cluster size and density: K-Means tries to create equal-sized clusters. Very different-sized clusters get mishandled
Workaround: use a larger $k$ than needed, then merge similar clusters after the fact.
K-Medoids (PAM)
K-Medoids replaces the centroid (which may not be a real data point) with a medoid - an actual point in the dataset that minimizes within-cluster distance. The update step swaps each medoid with a non-medoid point and keeps the swap if it reduces total loss. This makes K-Medoids more interpretable and supports non-Euclidean distance metrics and kernels.
Choosing K: The Elbow Method
Plot total within-cluster variance versus $k$. As $k$ increases, variance decreases - but with diminishing returns. The elbow where the curve flattens is your $k$.
Hierarchical Clustering
Agglomerative Approach
Start with each point as its own cluster and iteratively merge the two most similar clusters until one remains. The result is a dendrogram - a tree structure showing the full merge history at every level of granularity. You choose the number of clusters by cutting the dendrogram at a desired height.
Linkage Criteria
How you measure inter-cluster distance defines the resulting cluster shapes:
| Linkage | Formula | Behavior |
|---|---|---|
| Single-link (min) | $\min_{p_i \in C_1, p_j \in C_2} \text{sim}(p_i, p_j)$ | Chaining: merges based on closest pair; sensitive to outliers |
| Complete-link (max) | $\max_{p_i \in C_1, p_j \in C_2} \text{sim}(p_i, p_j)$ | Compact clusters; breaks with very different cluster sizes |
| Average-link | $\frac{\sum \text{sim}(p_i, p_j)}{ | C_1 |
| Ward’s method | Minimizes total within-cluster variance after merge | Tends to produce similarly sized, compact clusters |
Complexity
- Time: $O(n^2 \log n)$ to $O(n^3)$
- Space: $O(n^2)$ for the proximity matrix
This is why hierarchical clustering is rarely used on large datasets. The memory requirement alone makes it impractical above roughly 10,000 points.
Choosing the Right Algorithm
| Scenario | Algorithm |
|---|---|
| Unknown cluster shape, noisy data | DBSCAN |
| Large dataset, globular clusters expected | K-Means++ |
| Need interpretable medoid representatives | K-Medoids |
| Need full dendrogram or no predetermined K | Agglomerative |
| Very large dataset | K-Means (most scalable) |