Clustering: Algorithms, Tradeoffs, and When to Use Each

Clustering looks deceptively simple. Group similar things together. Done. But “similar” is not a single concept. Two points can be close in Euclidean space and still belong to fundamentally different clusters. A dense ring surrounds a dense center. A crescent wraps around a blob. Three major algorithm families handle this problem with different assumptions - and fail in different ways. Know the failure modes before you pick the method.

Evaluating Clustering Quality: The Dunn Index

Before choosing an algorithm, you need a way to measure whether the result is actually good. The Dunn Index formalizes the intuition of tight clusters that are far apart:

$$\text{Dunn Index} = \frac{\min_{1 \le i < j \le k} d(C_i, C_j)}{\max_{1 \le i \le k} d’(C_i)}$$

where $d(C_i, C_j)$ is the inter-cluster distance between clusters $i$ and $j$, and $d’(C_i)$ is the intra-cluster diameter of cluster $i$. Larger Dunn Index = better clustering. You want the closest inter-cluster gap to exceed the largest intra-cluster spread.

DBSCAN: Density-Based Clustering

The Core Idea

DBSCAN defines clusters as dense regions separated by sparse regions. Every point gets classified as one of three types:

Core point: has at least MinPoints neighbors within radius Eps
Border point: within Eps of a core point but lacks enough neighbors to be core itself
Noise point: neither core nor border - treated as an outlier

A cluster is all points reachable from a core point through density-connected paths.

Algorithm

For each point $x_i$, label it as core, border, or noise using range queries built on a KD-Tree
Remove all noise points
For each unassigned core point, create a new cluster and add all density-connected points
Assign border points to their nearest core point’s cluster

Hyperparameter Selection

MinPoints: start at $2 \times \text{dimensionality}$. Increase for noisy data.
Eps: for each point, compute $d_i$ = distance to the MinPoints-th nearest neighbor. Sort these values and plot them. The elbow in this plot is the right Eps.

Complexity

Time: $O(n \log n)$ with KD-Tree indexing
Space: $O(n)$

When It Fails

DBSCAN breaks down when clusters have varying density. It uses a global Eps threshold, so a cluster less dense than the background vanishes into the noise class. The HDBSCAN variant solves this by constructing a hierarchy of density levels.

K-Means: Centroid-Based Clustering

The Optimization Problem

K-Means minimizes the total squared distance from each point to its cluster centroid:

$$\min_{{C_j}k} \sum{i=1}^k \sum_{x \in S_i} |x - C_i|^2$$

where $C_i = \frac{1}{|S_i|} \sum_{x_j \in S_i} x_j$ is the centroid of cluster $i$.

Lloyd’s Algorithm

Initialize: pick $k$ random points as centroids
Assign: each point to its nearest centroid
Update: recompute centroids as the mean of assigned points
Repeat until centroids change by less than a threshold

K-Means++: Smart Initialization

Final clusters depend heavily on initialization. K-Means++ selects initial centroids with probability proportional to their squared distance from the nearest existing centroid - ensuring the initial centroids are spread out across the data.

Pick $C_1$ randomly from the data
For each remaining point $x_i$, compute $d_i = \text{dist}(x_i, \text{nearest centroid})^2$
Sample the next centroid with probability proportional to $d_i$
Repeat until $k$ centroids are chosen

Complexity

Training: $O(nkdi)$ where $n$ = points, $k$ = clusters, $d$ = dimensions, $i$ = iterations
Practically $O(nd)$ since $k$ and $i$ are small

Failure Modes

Outlier sensitivity: a single outlier pulls a centroid far from the true cluster center
Assumes globular clusters: K-Means partitions space into Voronoi cells, which are convex by construction. It cannot find ring-shaped, crescent, or other nonconvex clusters
Assumes equal cluster size and density: K-Means tries to create equal-sized clusters. Very different-sized clusters get mishandled

Workaround: use a larger $k$ than needed, then merge similar clusters after the fact.

K-Medoids (PAM)

K-Medoids replaces the centroid (which may not be a real data point) with a medoid - an actual point in the dataset that minimizes within-cluster distance. The update step swaps each medoid with a non-medoid point and keeps the swap if it reduces total loss. This makes K-Medoids more interpretable and supports non-Euclidean distance metrics and kernels.

Choosing K: The Elbow Method

Plot total within-cluster variance versus $k$. As $k$ increases, variance decreases - but with diminishing returns. The elbow where the curve flattens is your $k$.

Hierarchical Clustering

Agglomerative Approach

Start with each point as its own cluster and iteratively merge the two most similar clusters until one remains. The result is a dendrogram - a tree structure showing the full merge history at every level of granularity. You choose the number of clusters by cutting the dendrogram at a desired height.

Linkage Criteria

How you measure inter-cluster distance defines the resulting cluster shapes:

Linkage	Formula	Behavior
Single-link (min)	$\min_{p_i \in C_1, p_j \in C_2} \text{sim}(p_i, p_j)$	Chaining: merges based on closest pair; sensitive to outliers
Complete-link (max)	$\max_{p_i \in C_1, p_j \in C_2} \text{sim}(p_i, p_j)$	Compact clusters; breaks with very different cluster sizes
Average-link	$\frac{\sum \text{sim}(p_i, p_j)}{	C_1
Ward’s method	Minimizes total within-cluster variance after merge	Tends to produce similarly sized, compact clusters

Complexity

Time: $O(n^2 \log n)$ to $O(n^3)$
Space: $O(n^2)$ for the proximity matrix

This is why hierarchical clustering is rarely used on large datasets. The memory requirement alone makes it impractical above roughly 10,000 points.

Choosing the Right Algorithm

Scenario	Algorithm
Unknown cluster shape, noisy data	DBSCAN
Large dataset, globular clusters expected	K-Means++
Need interpretable medoid representatives	K-Medoids
Need full dendrogram or no predetermined K	Agglomerative
Very large dataset	K-Means (most scalable)

Clustering: Algorithms, Tradeoffs, and When to Use Each

Evaluating Clustering Quality: The Dunn Index

DBSCAN: Density-Based Clustering

The Core Idea

Algorithm

Hyperparameter Selection

Complexity

When It Fails

K-Means: Centroid-Based Clustering

The Optimization Problem

Lloyd’s Algorithm

K-Means++: Smart Initialization

Complexity

Failure Modes

K-Medoids (PAM)

Choosing K: The Elbow Method

Hierarchical Clustering

Agglomerative Approach

Linkage Criteria

Complexity

Choosing the Right Algorithm

Have a problem worth solving?