Writing ml
ml 7 min read 8 January 2023

Dimensionality Reduction: PCA, t-SNE, and UMAP

A practical guide to the three main dimensionality reduction techniques — when to use each, what they preserve, and how to avoid the common mistake of using t-SNE embeddings as features.

Why Dimensionality Reduction Matters

High-dimensional data causes the curse of dimensionality: as dimensions grow, distances between points become increasingly similar (all points are roughly equidistant), making nearest-neighbor search meaningless and density estimation unreliable. Beyond numerical problems, human understanding requires 2D or 3D projections.

Dimensionality reduction serves two distinct purposes:

  1. Visualization — project to 2D or 3D for exploratory analysis and cluster discovery
  2. Preprocessing — remove noise dimensions that dilute signal and hurt downstream model performance

These goals require different algorithms. The right choice depends on which structure you need to preserve.

PCA: Linear Variance Maximization

Principal Component Analysis finds the directions of maximum variance in the data. The principal components are the eigenvectors of the data covariance matrix $\mathbf{C} = \frac{1}{n}\mathbf{X}^T\mathbf{X}$; the corresponding eigenvalues measure how much variance each component captures.

Algorithm:

  1. Center the data: $\mathbf{X} \leftarrow \mathbf{X} - \bar{\mathbf{X}}$
  2. Compute the covariance matrix (or use SVD directly for numerical stability)
  3. Take the top $k$ eigenvectors as the projection basis
  4. Project: $\mathbf{Z} = \mathbf{X} \mathbf{W}_k$ where $\mathbf{W}_k$ is the $d \times k$ matrix of top eigenvectors

What PCA preserves: global structure — directions of maximum variance. Points that are far apart in high-dimensional space tend to remain far apart in the PCA projection.

When to use PCA:

Choosing $k$: plot the cumulative explained variance ratio vs. number of components. Choose $k$ where the curve flattens — typically 90–95% explained variance is a good target for preprocessing.

LDA (Linear Discriminant Analysis): the supervised variant of PCA. Instead of maximizing total variance, LDA maximizes the ratio of between-class variance to within-class variance. Requires class labels; produces at most $C-1$ components (where $C$ is the number of classes).

t-SNE: Non-linear Neighborhood Preservation

t-Distributed Stochastic Neighbor Embedding constructs a probability distribution over pairs of points in the high-dimensional space (using a Gaussian kernel) and a corresponding distribution in 2D (using a heavier-tailed Student-t kernel). It then minimizes the KL divergence between the two distributions via gradient descent.

The Student-t kernel in the low-dimensional space is key: it prevents the “crowding problem” where moderately distant points in high dimensions get crowded into a small region in 2D. The heavy tail allows distant points to be mapped further apart.

What t-SNE preserves: local structure — nearby points in high dimensions tend to remain nearby in the projection. Clusters that exist in high-dimensional space appear as visible clusters in the 2D projection.

What t-SNE does NOT preserve:

Hyperparameters:

Critical warning: never use t-SNE embeddings as features for downstream models. The embedding is not stable across runs, the distances are not meaningful, and the mapping is not invertible. t-SNE is a visualization tool only.

UMAP: Manifold-Based Reduction

Uniform Manifold Approximation and Projection constructs a weighted graph in high-dimensional space (based on k-nearest neighbors) and optimizes a low-dimensional embedding that preserves this graph structure using a cross-entropy loss.

What UMAP preserves: both local structure (like t-SNE) and, better than t-SNE, global structure — the relative positions of clusters are more meaningful.

Practical advantages over t-SNE:

When to use UMAP over t-SNE: when you care about the relationships between clusters (not just within-cluster structure), when the dataset is large (>50K points), or when you need a reproducible embedding across multiple runs.

Decision Guide

GoalAlgorithm
Preprocessing, interpretable featuresPCA
Supervised reduction with class labelsLDA
Exploratory visualization, cluster discoveryt-SNE or UMAP
Large dataset visualization (>50K points)UMAP
Stable, reproducible embeddingUMAP > t-SNE
Understanding global cluster relationshipsUMAP

Common Mistakes

Using t-SNE for preprocessing: t-SNE embeddings should not be fed into a classifier. The distances are non-metric and the embedding is not stable.

Interpreting cluster sizes in t-SNE: cluster sizes and inter-cluster distances in t-SNE are artifacts of the algorithm, not reflections of the actual data density.

Fitting PCA on test data: compute PCA on the training set only and apply the learned transformation to test data. Fitting PCA on test data leaks information.

Choosing perplexity carelessly: t-SNE results depend heavily on perplexity. Always try at least three values (5, 30, 50) and compare the results before drawing conclusions.

dimensionality-reduction pca t-sne umap unsupervised-learning
← All articles

Lets collaborate!

Whether you need a quantitative researcher, an machine learning systems builder, or a technical advisor — I'm available for select consulting engagements.

Get in Touch →