Dimensionality Reduction: PCA, t-SNE, and UMAP

High-dimensional data is a trap. As dimensions grow, distances between points become increasingly similar - all points are roughly equidistant - making nearest-neighbor search meaningless and density estimation unreliable. Beyond the numerical problems, human understanding requires 2D or 3D projections. Dimensionality reduction is how you get there without losing what matters.

But “what matters” depends entirely on your goal. Visualization and preprocessing require different algorithms. Choosing the wrong one produces misleading results and wasted effort.

Dimensionality Reduction Serves Two Distinct Purposes

Visualization - project to 2D or 3D for exploratory analysis and cluster discovery
Preprocessing - remove noise dimensions that dilute signal and hurt downstream model performance

The right choice depends on which structure you need to preserve.

PCA: Linear Variance Maximization

Principal Component Analysis finds the directions of maximum variance in the data. The principal components are the eigenvectors of the data covariance matrix $\mathbf{C} = \frac{1}{n}\mathbf{X}^T\mathbf{X}$; the corresponding eigenvalues measure how much variance each component captures.

Algorithm:

Center the data: $\mathbf{X} \leftarrow \mathbf{X} - \bar{\mathbf{X}}$
Compute the covariance matrix (or use SVD directly for numerical stability)
Take the top $k$ eigenvectors as the projection basis
Project: $\mathbf{Z} = \mathbf{X} \mathbf{W}_k$ where $\mathbf{W}_k$ is the $d \times k$ matrix of top eigenvectors

What PCA preserves: global structure - directions of maximum variance. Points that are far apart in high-dimensional space tend to remain far apart in the PCA projection.

When to use PCA:

As a preprocessing step before training (remove noise dimensions, decorrelate features)
When you need interpretable components (each component is a linear combination of original features)
When computational speed matters (PCA is fast)
When the data is approximately Gaussian (PCA is the optimal linear reduction under Gaussian assumptions)

Choosing $k$: plot the cumulative explained variance ratio versus number of components. Choose $k$ where the curve flattens - typically 90 to 95% explained variance is a good target for preprocessing.

LDA (Linear Discriminant Analysis): the supervised variant of PCA. Instead of maximizing total variance, LDA maximizes the ratio of between-class variance to within-class variance. Requires class labels; produces at most $C-1$ components (where $C$ is the number of classes).

t-SNE: Nonlinear Neighborhood Preservation

t-Distributed Stochastic Neighbor Embedding constructs a probability distribution over pairs of points in the high-dimensional space (using a Gaussian kernel) and a corresponding distribution in 2D (using a heavier-tailed Student-t kernel). It then minimizes the KL divergence between the two distributions via gradient descent.

The Student-t kernel in the low-dimensional space is the key insight: it prevents the “crowding problem” where moderately distant points in high dimensions get crowded into a small region in 2D. The heavy tail allows distant points to be mapped further apart.

What t-SNE preserves: local structure - nearby points in high dimensions tend to remain nearby in the projection. Clusters that exist in high-dimensional space appear as visible clusters in the 2D projection.

What t-SNE does NOT preserve:

Global distances: the distance between two clusters in a t-SNE plot is not meaningful
Density: a large, dense cluster and a small, sparse cluster may appear the same size
Stability: t-SNE embeddings are stochastic - different random seeds produce different layouts

Hyperparameters:

Perplexity (5 to 50): roughly the number of effective neighbors per point. Low perplexity = tight local structure; high perplexity = more global. Run with multiple values.
Learning rate (10 to 1000): controls convergence speed. The default usually works.
n_iter (250 or more): more iterations produce a more stable result

Critical warning: never use t-SNE embeddings as features for downstream models. The embedding is not stable across runs, the distances are not meaningful, and the mapping is not invertible. t-SNE is a visualization tool only.

UMAP: Manifold-Based Reduction

Uniform Manifold Approximation and Projection constructs a weighted graph in high-dimensional space (based on k-nearest neighbors) and optimizes a low-dimensional embedding that preserves this graph structure using a cross-entropy loss.

What UMAP preserves: both local structure (like t-SNE) and, better than t-SNE, global structure - the relative positions of clusters are more meaningful.

Practical advantages over t-SNE:

Faster: scales better to large datasets (near-linear versus $O(n^2)$ for t-SNE)
More stable: lower variance across runs with the same random seed
More usable as preprocessing: the embedding is more stable and globally consistent, making UMAP features more defensible than t-SNE features for downstream models
Handles larger $k$ naturally

When to use UMAP over t-SNE: when you care about the relationships between clusters (not just within-cluster structure), when the dataset is large (more than 50K points), or when you need a reproducible embedding across multiple runs.

Decision Guide

Goal	Algorithm
Preprocessing, interpretable features	PCA
Supervised reduction with class labels	LDA
Exploratory visualization, cluster discovery	t-SNE or UMAP
Large dataset visualization (more than 50K points)	UMAP
Stable, reproducible embedding	UMAP over t-SNE
Understanding global cluster relationships	UMAP

Common Mistakes

Using t-SNE for preprocessing: t-SNE embeddings should not be fed into a classifier. The distances are non-metric and the embedding is unstable across runs.

Interpreting cluster sizes in t-SNE: cluster sizes and inter-cluster distances in t-SNE are artifacts of the algorithm, not reflections of actual data density.

Fitting PCA on test data: compute PCA on the training set only and apply the learned transformation to test data. Fitting PCA on test data leaks information.

Choosing perplexity carelessly: t-SNE results depend heavily on perplexity. Always try at least three values (5, 30, 50) and compare the results before drawing conclusions.