High-dimensional data is a trap. As dimensions grow, distances between points become increasingly similar - all points are roughly equidistant - making nearest-neighbor search meaningless and density estimation unreliable. Beyond the numerical problems, human understanding requires 2D or 3D projections. Dimensionality reduction is how you get there without losing what matters.
But “what matters” depends entirely on your goal. Visualization and preprocessing require different algorithms. Choosing the wrong one produces misleading results and wasted effort.
Dimensionality Reduction Serves Two Distinct Purposes
- Visualization - project to 2D or 3D for exploratory analysis and cluster discovery
- Preprocessing - remove noise dimensions that dilute signal and hurt downstream model performance
The right choice depends on which structure you need to preserve.
PCA: Linear Variance Maximization
Principal Component Analysis finds the directions of maximum variance in the data. The principal components are the eigenvectors of the data covariance matrix $\mathbf{C} = \frac{1}{n}\mathbf{X}^T\mathbf{X}$; the corresponding eigenvalues measure how much variance each component captures.
Algorithm:
- Center the data: $\mathbf{X} \leftarrow \mathbf{X} - \bar{\mathbf{X}}$
- Compute the covariance matrix (or use SVD directly for numerical stability)
- Take the top $k$ eigenvectors as the projection basis
- Project: $\mathbf{Z} = \mathbf{X} \mathbf{W}_k$ where $\mathbf{W}_k$ is the $d \times k$ matrix of top eigenvectors
What PCA preserves: global structure - directions of maximum variance. Points that are far apart in high-dimensional space tend to remain far apart in the PCA projection.
When to use PCA:
- As a preprocessing step before training (remove noise dimensions, decorrelate features)
- When you need interpretable components (each component is a linear combination of original features)
- When computational speed matters (PCA is fast)
- When the data is approximately Gaussian (PCA is the optimal linear reduction under Gaussian assumptions)
Choosing $k$: plot the cumulative explained variance ratio versus number of components. Choose $k$ where the curve flattens - typically 90 to 95% explained variance is a good target for preprocessing.
LDA (Linear Discriminant Analysis): the supervised variant of PCA. Instead of maximizing total variance, LDA maximizes the ratio of between-class variance to within-class variance. Requires class labels; produces at most $C-1$ components (where $C$ is the number of classes).
t-SNE: Nonlinear Neighborhood Preservation
t-Distributed Stochastic Neighbor Embedding constructs a probability distribution over pairs of points in the high-dimensional space (using a Gaussian kernel) and a corresponding distribution in 2D (using a heavier-tailed Student-t kernel). It then minimizes the KL divergence between the two distributions via gradient descent.
The Student-t kernel in the low-dimensional space is the key insight: it prevents the “crowding problem” where moderately distant points in high dimensions get crowded into a small region in 2D. The heavy tail allows distant points to be mapped further apart.
What t-SNE preserves: local structure - nearby points in high dimensions tend to remain nearby in the projection. Clusters that exist in high-dimensional space appear as visible clusters in the 2D projection.
What t-SNE does NOT preserve:
- Global distances: the distance between two clusters in a t-SNE plot is not meaningful
- Density: a large, dense cluster and a small, sparse cluster may appear the same size
- Stability: t-SNE embeddings are stochastic - different random seeds produce different layouts
Hyperparameters:
- Perplexity (5 to 50): roughly the number of effective neighbors per point. Low perplexity = tight local structure; high perplexity = more global. Run with multiple values.
- Learning rate (10 to 1000): controls convergence speed. The default usually works.
- n_iter (250 or more): more iterations produce a more stable result
Critical warning: never use t-SNE embeddings as features for downstream models. The embedding is not stable across runs, the distances are not meaningful, and the mapping is not invertible. t-SNE is a visualization tool only.
UMAP: Manifold-Based Reduction
Uniform Manifold Approximation and Projection constructs a weighted graph in high-dimensional space (based on k-nearest neighbors) and optimizes a low-dimensional embedding that preserves this graph structure using a cross-entropy loss.
What UMAP preserves: both local structure (like t-SNE) and, better than t-SNE, global structure - the relative positions of clusters are more meaningful.
Practical advantages over t-SNE:
- Faster: scales better to large datasets (near-linear versus $O(n^2)$ for t-SNE)
- More stable: lower variance across runs with the same random seed
- More usable as preprocessing: the embedding is more stable and globally consistent, making UMAP features more defensible than t-SNE features for downstream models
- Handles larger $k$ naturally
When to use UMAP over t-SNE: when you care about the relationships between clusters (not just within-cluster structure), when the dataset is large (more than 50K points), or when you need a reproducible embedding across multiple runs.
Decision Guide
| Goal | Algorithm |
|---|---|
| Preprocessing, interpretable features | PCA |
| Supervised reduction with class labels | LDA |
| Exploratory visualization, cluster discovery | t-SNE or UMAP |
| Large dataset visualization (more than 50K points) | UMAP |
| Stable, reproducible embedding | UMAP over t-SNE |
| Understanding global cluster relationships | UMAP |
Common Mistakes
Using t-SNE for preprocessing: t-SNE embeddings should not be fed into a classifier. The distances are non-metric and the embedding is unstable across runs.
Interpreting cluster sizes in t-SNE: cluster sizes and inter-cluster distances in t-SNE are artifacts of the algorithm, not reflections of actual data density.
Fitting PCA on test data: compute PCA on the training set only and apply the learned transformation to test data. Fitting PCA on test data leaks information.
Choosing perplexity carelessly: t-SNE results depend heavily on perplexity. Always try at least three values (5, 30, 50) and compare the results before drawing conclusions.