Writing machine learning
machine-learning 10 min read 3 January 2023

Clustering: Algorithms, Tradeoffs, and When to Use Each

A technical reference for the three main clustering families - density-based (DBSCAN), centroid-based (K-Means), and hierarchical - covering their mathematical foundations, hyperparameter selection, and failure modes.

Clustering looks deceptively simple. Group similar things together. Done. But “similar” is not a single concept. Two points can be close in Euclidean space and still belong to fundamentally different clusters. A dense ring surrounds a dense center. A crescent wraps around a blob. Three major algorithm families handle this problem with different assumptions - and fail in different ways. Know the failure modes before you pick the method.

Evaluating Clustering Quality: The Dunn Index

Before choosing an algorithm, you need a way to measure whether the result is actually good. The Dunn Index formalizes the intuition of tight clusters that are far apart:

$$\text{Dunn Index} = \frac{\min_{1 \le i < j \le k} d(C_i, C_j)}{\max_{1 \le i \le k} d’(C_i)}$$

where $d(C_i, C_j)$ is the inter-cluster distance between clusters $i$ and $j$, and $d’(C_i)$ is the intra-cluster diameter of cluster $i$. Larger Dunn Index = better clustering. You want the closest inter-cluster gap to exceed the largest intra-cluster spread.

DBSCAN: Density-Based Clustering

The Core Idea

DBSCAN defines clusters as dense regions separated by sparse regions. Every point gets classified as one of three types:

A cluster is all points reachable from a core point through density-connected paths.

Algorithm

  1. For each point $x_i$, label it as core, border, or noise using range queries built on a KD-Tree
  2. Remove all noise points
  3. For each unassigned core point, create a new cluster and add all density-connected points
  4. Assign border points to their nearest core point’s cluster

Hyperparameter Selection

Complexity

When It Fails

DBSCAN breaks down when clusters have varying density. It uses a global Eps threshold, so a cluster less dense than the background vanishes into the noise class. The HDBSCAN variant solves this by constructing a hierarchy of density levels.

K-Means: Centroid-Based Clustering

The Optimization Problem

K-Means minimizes the total squared distance from each point to its cluster centroid:

$$\min_{{C_j}k} \sum{i=1}^k \sum_{x \in S_i} |x - C_i|^2$$

where $C_i = \frac{1}{|S_i|} \sum_{x_j \in S_i} x_j$ is the centroid of cluster $i$.

Lloyd’s Algorithm

  1. Initialize: pick $k$ random points as centroids
  2. Assign: each point to its nearest centroid
  3. Update: recompute centroids as the mean of assigned points
  4. Repeat until centroids change by less than a threshold

K-Means++: Smart Initialization

Final clusters depend heavily on initialization. K-Means++ selects initial centroids with probability proportional to their squared distance from the nearest existing centroid - ensuring the initial centroids are spread out across the data.

  1. Pick $C_1$ randomly from the data
  2. For each remaining point $x_i$, compute $d_i = \text{dist}(x_i, \text{nearest centroid})^2$
  3. Sample the next centroid with probability proportional to $d_i$
  4. Repeat until $k$ centroids are chosen

Complexity

Failure Modes

Workaround: use a larger $k$ than needed, then merge similar clusters after the fact.

K-Medoids (PAM)

K-Medoids replaces the centroid (which may not be a real data point) with a medoid - an actual point in the dataset that minimizes within-cluster distance. The update step swaps each medoid with a non-medoid point and keeps the swap if it reduces total loss. This makes K-Medoids more interpretable and supports non-Euclidean distance metrics and kernels.

Choosing K: The Elbow Method

Plot total within-cluster variance versus $k$. As $k$ increases, variance decreases - but with diminishing returns. The elbow where the curve flattens is your $k$.

Hierarchical Clustering

Agglomerative Approach

Start with each point as its own cluster and iteratively merge the two most similar clusters until one remains. The result is a dendrogram - a tree structure showing the full merge history at every level of granularity. You choose the number of clusters by cutting the dendrogram at a desired height.

Linkage Criteria

How you measure inter-cluster distance defines the resulting cluster shapes:

LinkageFormulaBehavior
Single-link (min)$\min_{p_i \in C_1, p_j \in C_2} \text{sim}(p_i, p_j)$Chaining: merges based on closest pair; sensitive to outliers
Complete-link (max)$\max_{p_i \in C_1, p_j \in C_2} \text{sim}(p_i, p_j)$Compact clusters; breaks with very different cluster sizes
Average-link$\frac{\sum \text{sim}(p_i, p_j)}{C_1
Ward’s methodMinimizes total within-cluster variance after mergeTends to produce similarly sized, compact clusters

Complexity

This is why hierarchical clustering is rarely used on large datasets. The memory requirement alone makes it impractical above roughly 10,000 points.

Choosing the Right Algorithm

ScenarioAlgorithm
Unknown cluster shape, noisy dataDBSCAN
Large dataset, globular clusters expectedK-Means++
Need interpretable medoid representativesK-Medoids
Need full dendrogram or no predetermined KAgglomerative
Very large datasetK-Means (most scalable)
clustering unsupervised-learning dbscan k-means ml
← All articles More in Machine Learning →

Have a problem worth solving?

Whether you need a quantitative researcher, a Machine Learning systems builder, or a technical advisor, I take a small number of consulting engagements at a time.

Book a call →