Dimensionality Reduction: PCA, t-SNE, and UMAP

Why Dimensionality Reduction Matters

High-dimensional data causes the curse of dimensionality: as dimensions grow, distances between points become increasingly similar (all points are roughly equidistant), making nearest-neighbor search meaningless and density estimation unreliable. Beyond numerical problems, human understanding requires 2D or 3D projections.

Dimensionality reduction serves two distinct purposes:

Visualization — project to 2D or 3D for exploratory analysis and cluster discovery
Preprocessing — remove noise dimensions that dilute signal and hurt downstream model performance

These goals require different algorithms. The right choice depends on which structure you need to preserve.

PCA: Linear Variance Maximization

Principal Component Analysis finds the directions of maximum variance in the data. The principal components are the eigenvectors of the data covariance matrix $\mathbf{C} = \frac{1}{n}\mathbf{X}^T\mathbf{X}$; the corresponding eigenvalues measure how much variance each component captures.

Algorithm:

Center the data: $\mathbf{X} \leftarrow \mathbf{X} - \bar{\mathbf{X}}$
Compute the covariance matrix (or use SVD directly for numerical stability)
Take the top $k$ eigenvectors as the projection basis
Project: $\mathbf{Z} = \mathbf{X} \mathbf{W}_k$ where $\mathbf{W}_k$ is the $d \times k$ matrix of top eigenvectors

What PCA preserves: global structure — directions of maximum variance. Points that are far apart in high-dimensional space tend to remain far apart in the PCA projection.

When to use PCA:

As a preprocessing step before training (remove noise dimensions, decorrelate features)
When you need interpretable components (each component is a linear combination of original features)
When computational speed matters (PCA is $O(nd^2)$ or $O(n^2 d)$ via SVD — fast)
When the data is approximately Gaussian (PCA is the optimal linear reduction under Gaussian assumptions)

Choosing $k$: plot the cumulative explained variance ratio vs. number of components. Choose $k$ where the curve flattens — typically 90–95% explained variance is a good target for preprocessing.

LDA (Linear Discriminant Analysis): the supervised variant of PCA. Instead of maximizing total variance, LDA maximizes the ratio of between-class variance to within-class variance. Requires class labels; produces at most $C-1$ components (where $C$ is the number of classes).

t-SNE: Non-linear Neighborhood Preservation

t-Distributed Stochastic Neighbor Embedding constructs a probability distribution over pairs of points in the high-dimensional space (using a Gaussian kernel) and a corresponding distribution in 2D (using a heavier-tailed Student-t kernel). It then minimizes the KL divergence between the two distributions via gradient descent.

The Student-t kernel in the low-dimensional space is key: it prevents the “crowding problem” where moderately distant points in high dimensions get crowded into a small region in 2D. The heavy tail allows distant points to be mapped further apart.

What t-SNE preserves: local structure — nearby points in high dimensions tend to remain nearby in the projection. Clusters that exist in high-dimensional space appear as visible clusters in the 2D projection.

What t-SNE does NOT preserve:

Global distances: the distance between two clusters in a t-SNE plot is not meaningful
Density: a large, dense cluster and a small, sparse cluster may appear the same size
Stability: t-SNE embeddings are stochastic (different random seeds → different layouts)

Hyperparameters:

Perplexity (5–50): roughly the number of effective neighbors per point. Low perplexity = tight local structure; high perplexity = more global. Run with multiple values.
Learning rate (10–1000): controls convergence speed. The default usually works.
n_iter (≥250): more iterations = more stable result

Critical warning: never use t-SNE embeddings as features for downstream models. The embedding is not stable across runs, the distances are not meaningful, and the mapping is not invertible. t-SNE is a visualization tool only.

UMAP: Manifold-Based Reduction

Uniform Manifold Approximation and Projection constructs a weighted graph in high-dimensional space (based on k-nearest neighbors) and optimizes a low-dimensional embedding that preserves this graph structure using a cross-entropy loss.

What UMAP preserves: both local structure (like t-SNE) and, better than t-SNE, global structure — the relative positions of clusters are more meaningful.

Practical advantages over t-SNE:

Faster: scales better to large datasets (near-linear vs. $O(n^2)$ for t-SNE)
More stable: lower variance across runs with the same random seed
Usable as preprocessing: the embedding is more stable and globally consistent, making UMAP features (cautiously) more useful than t-SNE features for downstream models
Handles larger $k$ naturally

When to use UMAP over t-SNE: when you care about the relationships between clusters (not just within-cluster structure), when the dataset is large (>50K points), or when you need a reproducible embedding across multiple runs.

Decision Guide

Goal	Algorithm
Preprocessing, interpretable features	PCA
Supervised reduction with class labels	LDA
Exploratory visualization, cluster discovery	t-SNE or UMAP
Large dataset visualization (>50K points)	UMAP
Stable, reproducible embedding	UMAP > t-SNE
Understanding global cluster relationships	UMAP

Common Mistakes

Using t-SNE for preprocessing: t-SNE embeddings should not be fed into a classifier. The distances are non-metric and the embedding is not stable.

Interpreting cluster sizes in t-SNE: cluster sizes and inter-cluster distances in t-SNE are artifacts of the algorithm, not reflections of the actual data density.

Fitting PCA on test data: compute PCA on the training set only and apply the learned transformation to test data. Fitting PCA on test data leaks information.

Choosing perplexity carelessly: t-SNE results depend heavily on perplexity. Always try at least three values (5, 30, 50) and compare the results before drawing conclusions.