Writing ml
ml 10 min read 3 January 2023

Clustering: Algorithms, Tradeoffs, and When to Use Each

A technical reference for the three main clustering families — density-based (DBSCAN), centroid-based (K-Means), and hierarchical — covering their mathematical foundations, hyperparameter selection, and failure modes.

Why Clustering Is Hard

Clustering is the task of grouping data points such that points within a group are more similar to each other than to points outside. The difficulty: “similar” is not a single thing. Two points can be close in Euclidean space but belong to fundamentally different clusters (a dense ring vs. a dense center). Three major algorithm families handle this problem with different assumptions — and fail in different ways.

Evaluating Clustering Quality: The Dunn Index

Before choosing an algorithm, you need a way to measure whether the result is good. The Dunn Index formalizes the intuition of “tight clusters, far apart”:

$$\text{Dunn Index} = \frac{\min_{1 \le i < j \le k} d(C_i, C_j)}{\max_{1 \le i \le k} d’(C_i)}$$

where $d(C_i, C_j)$ is the inter-cluster distance between clusters $i$ and $j$, and $d’(C_i)$ is the intra-cluster diameter of cluster $i$. Larger Dunn Index = better clustering — you want the closest inter-cluster gap to exceed the largest intra-cluster spread.

DBSCAN: Density-Based Clustering

The Core Idea

DBSCAN defines clusters as dense regions separated by sparse regions. It classifies every point as one of three types:

A cluster is all points reachable from a core point through density-connected paths.

Algorithm

  1. For each point $x_i$, label it as core, border, or noise using range queries built on a KD-Tree
  2. Remove all noise points
  3. For each unassigned core point, create a new cluster and add all density-connected points
  4. Assign border points to their nearest core point’s cluster

Hyperparameter Selection

Complexity

When It Fails

DBSCAN struggles when clusters have varying density — it uses a global Eps threshold, so a cluster that is less dense than the background will appear as noise. The HDBSCAN variant addresses this by constructing a hierarchy of density levels.

K-Means: Centroid-Based Clustering

The Optimization Problem

K-Means minimizes the total squared distance from each point to its cluster centroid:

$$\min_{{C_j}k} \sum{i=1}^k \sum_{x \in S_i} |x - C_i|^2$$

where $C_i = \frac{1}{|S_i|} \sum_{x_j \in S_i} x_j$ is the centroid of cluster $i$.

Lloyd’s Algorithm

  1. Initialize: pick $k$ random points as centroids
  2. Assign: each point to its nearest centroid
  3. Update: recompute centroids as the mean of assigned points
  4. Repeat until centroids change by less than a threshold

K-Means++: Smart Initialization

The final clusters depend heavily on initialization. K-Means++ selects initial centroids with probability proportional to their squared distance from the nearest existing centroid — ensuring the initial centroids are spread out across the data:

  1. Pick $C_1$ randomly from the data
  2. For each remaining point $x_i$, compute $d_i = \text{dist}(x_i, \text{nearest centroid})^2$
  3. Sample the next centroid with probability proportional to $d_i$
  4. Repeat until $k$ centroids are chosen

Complexity

Failure Modes

Workaround: use a larger $k$ than needed, then merge similar clusters post-hoc.

K-Medoids (PAM)

K-Medoids replaces the centroid (which may not be a real data point) with a medoid — an actual point in the dataset that minimizes within-cluster distance. The update step swaps each medoid with a non-medoid point and keeps the swap if it reduces total loss. This makes K-Medoids more interpretable and supports non-Euclidean distance metrics and kernels.

Choosing K: The Elbow Method

Plot the total within-cluster variance vs. $k$. As $k$ increases, variance decreases — but with diminishing returns. The “elbow” where the curve flattens is the appropriate $k$.

Hierarchical Clustering

Agglomerative Approach

Start with each point as its own cluster and iteratively merge the two most similar clusters until one remains. The result is a dendrogram — a tree structure showing the full merge history at every level of granularity. You choose the number of clusters by cutting the dendrogram at a desired height.

Linkage Criteria

How you measure inter-cluster distance defines the resulting cluster shapes:

LinkageFormulaBehavior
Single-link (min)$\min_{p_i \in C_1, p_j \in C_2} \text{sim}(p_i, p_j)$Chaining: merges based on closest pair; sensitive to outliers
Complete-link (max)$\max_{p_i \in C_1, p_j \in C_2} \text{sim}(p_i, p_j)$Compact clusters; breaks with very different cluster sizes
Average-link$\frac{\sum \text{sim}(p_i, p_j)}{C_1
Ward’s methodMinimizes total within-cluster variance after mergeTends to produce similarly-sized, compact clusters

Complexity

This is why hierarchical clustering is rarely used on large datasets — the memory requirement alone makes it impractical above ~10,000 points.

Choosing the Right Algorithm

ScenarioAlgorithm
Unknown cluster shape, noisy dataDBSCAN
Large dataset, globular clusters expectedK-Means++
Need interpretable medoid representativesK-Medoids
Need full dendrogram / no predetermined KAgglomerative
Very large datasetK-Means (most scalable)
clustering unsupervised-learning dbscan k-means ml
← All articles

Lets collaborate!

Whether you need a quantitative researcher, an machine learning systems builder, or a technical advisor — I'm available for select consulting engagements.

Get in Touch →