Clustering: Algorithms, Tradeoffs, and When to Use Each

Why Clustering Is Hard

Clustering is the task of grouping data points such that points within a group are more similar to each other than to points outside. The difficulty: “similar” is not a single thing. Two points can be close in Euclidean space but belong to fundamentally different clusters (a dense ring vs. a dense center). Three major algorithm families handle this problem with different assumptions — and fail in different ways.

Evaluating Clustering Quality: The Dunn Index

Before choosing an algorithm, you need a way to measure whether the result is good. The Dunn Index formalizes the intuition of “tight clusters, far apart”:

$$\text{Dunn Index} = \frac{\min_{1 \le i < j \le k} d(C_i, C_j)}{\max_{1 \le i \le k} d’(C_i)}$$

where $d(C_i, C_j)$ is the inter-cluster distance between clusters $i$ and $j$, and $d’(C_i)$ is the intra-cluster diameter of cluster $i$. Larger Dunn Index = better clustering — you want the closest inter-cluster gap to exceed the largest intra-cluster spread.

DBSCAN: Density-Based Clustering

The Core Idea

DBSCAN defines clusters as dense regions separated by sparse regions. It classifies every point as one of three types:

Core point: has ≥ MinPoints neighbors within radius Eps
Border point: within Eps of a core point but doesn’t have enough neighbors to be core itself
Noise point: neither core nor border — treated as an outlier

A cluster is all points reachable from a core point through density-connected paths.

Algorithm

For each point $x_i$, label it as core, border, or noise using range queries built on a KD-Tree
Remove all noise points
For each unassigned core point, create a new cluster and add all density-connected points
Assign border points to their nearest core point’s cluster

Hyperparameter Selection

MinPoints: start at $2 \times \text{dimensionality}$. Increase for noisy data.
Eps: for each point, compute $d_i$ = distance to the MinPoints-th nearest neighbor. Sort these values and plot them — the “elbow” in this plot is the right Eps.

Complexity

Time: $O(n \log n)$ with KD-Tree indexing
Space: $O(n)$

When It Fails

DBSCAN struggles when clusters have varying density — it uses a global Eps threshold, so a cluster that is less dense than the background will appear as noise. The HDBSCAN variant addresses this by constructing a hierarchy of density levels.

K-Means: Centroid-Based Clustering

The Optimization Problem

K-Means minimizes the total squared distance from each point to its cluster centroid:

$$\min_{{C_j}k} \sum{i=1}^k \sum_{x \in S_i} |x - C_i|^2$$

where $C_i = \frac{1}{|S_i|} \sum_{x_j \in S_i} x_j$ is the centroid of cluster $i$.

Lloyd’s Algorithm

Initialize: pick $k$ random points as centroids
Assign: each point to its nearest centroid
Update: recompute centroids as the mean of assigned points
Repeat until centroids change by less than a threshold

K-Means++: Smart Initialization

The final clusters depend heavily on initialization. K-Means++ selects initial centroids with probability proportional to their squared distance from the nearest existing centroid — ensuring the initial centroids are spread out across the data:

Pick $C_1$ randomly from the data
For each remaining point $x_i$, compute $d_i = \text{dist}(x_i, \text{nearest centroid})^2$
Sample the next centroid with probability proportional to $d_i$
Repeat until $k$ centroids are chosen

Complexity

Training: $O(nkdi)$ where $n$ = points, $k$ = clusters, $d$ = dimensions, $i$ = iterations
Practically $O(nd)$ since $k$ and $i$ are small

Failure Modes

Outlier sensitivity: a single outlier can pull a centroid far from the cluster center
Assumes globular clusters: K-Means partitions space into Voronoi cells, which are convex by construction — it cannot find ring-shaped, crescent, or other non-convex clusters
Assumes equal cluster size and density: K-Means tries to create equal-sized clusters; very different-sized clusters are mishandled

Workaround: use a larger $k$ than needed, then merge similar clusters post-hoc.

K-Medoids (PAM)

K-Medoids replaces the centroid (which may not be a real data point) with a medoid — an actual point in the dataset that minimizes within-cluster distance. The update step swaps each medoid with a non-medoid point and keeps the swap if it reduces total loss. This makes K-Medoids more interpretable and supports non-Euclidean distance metrics and kernels.

Choosing K: The Elbow Method

Plot the total within-cluster variance vs. $k$. As $k$ increases, variance decreases — but with diminishing returns. The “elbow” where the curve flattens is the appropriate $k$.

Hierarchical Clustering

Agglomerative Approach

Start with each point as its own cluster and iteratively merge the two most similar clusters until one remains. The result is a dendrogram — a tree structure showing the full merge history at every level of granularity. You choose the number of clusters by cutting the dendrogram at a desired height.

Linkage Criteria

How you measure inter-cluster distance defines the resulting cluster shapes:

Linkage	Formula	Behavior
Single-link (min)	$\min_{p_i \in C_1, p_j \in C_2} \text{sim}(p_i, p_j)$	Chaining: merges based on closest pair; sensitive to outliers
Complete-link (max)	$\max_{p_i \in C_1, p_j \in C_2} \text{sim}(p_i, p_j)$	Compact clusters; breaks with very different cluster sizes
Average-link	$\frac{\sum \text{sim}(p_i, p_j)}{	C_1
Ward’s method	Minimizes total within-cluster variance after merge	Tends to produce similarly-sized, compact clusters

Complexity

Time: $O(n^2 \log n)$ to $O(n^3)$
Space: $O(n^2)$ for the proximity matrix

This is why hierarchical clustering is rarely used on large datasets — the memory requirement alone makes it impractical above ~10,000 points.

Choosing the Right Algorithm

Scenario	Algorithm
Unknown cluster shape, noisy data	DBSCAN
Large dataset, globular clusters expected	K-Means++
Need interpretable medoid representatives	K-Medoids
Need full dendrogram / no predetermined K	Agglomerative
Very large dataset	K-Means (most scalable)

Clustering: Algorithms, Tradeoffs, and When to Use Each

Why Clustering Is Hard

Evaluating Clustering Quality: The Dunn Index

DBSCAN: Density-Based Clustering

The Core Idea

Algorithm

Hyperparameter Selection

Complexity

When It Fails

K-Means: Centroid-Based Clustering

The Optimization Problem

Lloyd’s Algorithm

K-Means++: Smart Initialization

Complexity

Failure Modes

K-Medoids (PAM)

Choosing K: The Elbow Method

Hierarchical Clustering

Agglomerative Approach

Linkage Criteria

Complexity

Choosing the Right Algorithm

Lets collaborate!