Why Dimensionality Reduction Matters
High-dimensional data causes the curse of dimensionality: as dimensions grow, distances between points become increasingly similar (all points are roughly equidistant), making nearest-neighbor search meaningless and density estimation unreliable. Beyond numerical problems, human understanding requires 2D or 3D projections.
Dimensionality reduction serves two distinct purposes:
- Visualization — project to 2D or 3D for exploratory analysis and cluster discovery
- Preprocessing — remove noise dimensions that dilute signal and hurt downstream model performance
These goals require different algorithms. The right choice depends on which structure you need to preserve.
PCA: Linear Variance Maximization
Principal Component Analysis finds the directions of maximum variance in the data. The principal components are the eigenvectors of the data covariance matrix $\mathbf{C} = \frac{1}{n}\mathbf{X}^T\mathbf{X}$; the corresponding eigenvalues measure how much variance each component captures.
Algorithm:
- Center the data: $\mathbf{X} \leftarrow \mathbf{X} - \bar{\mathbf{X}}$
- Compute the covariance matrix (or use SVD directly for numerical stability)
- Take the top $k$ eigenvectors as the projection basis
- Project: $\mathbf{Z} = \mathbf{X} \mathbf{W}_k$ where $\mathbf{W}_k$ is the $d \times k$ matrix of top eigenvectors
What PCA preserves: global structure — directions of maximum variance. Points that are far apart in high-dimensional space tend to remain far apart in the PCA projection.
When to use PCA:
- As a preprocessing step before training (remove noise dimensions, decorrelate features)
- When you need interpretable components (each component is a linear combination of original features)
- When computational speed matters (PCA is $O(nd^2)$ or $O(n^2 d)$ via SVD — fast)
- When the data is approximately Gaussian (PCA is the optimal linear reduction under Gaussian assumptions)
Choosing $k$: plot the cumulative explained variance ratio vs. number of components. Choose $k$ where the curve flattens — typically 90–95% explained variance is a good target for preprocessing.
LDA (Linear Discriminant Analysis): the supervised variant of PCA. Instead of maximizing total variance, LDA maximizes the ratio of between-class variance to within-class variance. Requires class labels; produces at most $C-1$ components (where $C$ is the number of classes).
t-SNE: Non-linear Neighborhood Preservation
t-Distributed Stochastic Neighbor Embedding constructs a probability distribution over pairs of points in the high-dimensional space (using a Gaussian kernel) and a corresponding distribution in 2D (using a heavier-tailed Student-t kernel). It then minimizes the KL divergence between the two distributions via gradient descent.
The Student-t kernel in the low-dimensional space is key: it prevents the “crowding problem” where moderately distant points in high dimensions get crowded into a small region in 2D. The heavy tail allows distant points to be mapped further apart.
What t-SNE preserves: local structure — nearby points in high dimensions tend to remain nearby in the projection. Clusters that exist in high-dimensional space appear as visible clusters in the 2D projection.
What t-SNE does NOT preserve:
- Global distances: the distance between two clusters in a t-SNE plot is not meaningful
- Density: a large, dense cluster and a small, sparse cluster may appear the same size
- Stability: t-SNE embeddings are stochastic (different random seeds → different layouts)
Hyperparameters:
- Perplexity (5–50): roughly the number of effective neighbors per point. Low perplexity = tight local structure; high perplexity = more global. Run with multiple values.
- Learning rate (10–1000): controls convergence speed. The default usually works.
- n_iter (≥250): more iterations = more stable result
Critical warning: never use t-SNE embeddings as features for downstream models. The embedding is not stable across runs, the distances are not meaningful, and the mapping is not invertible. t-SNE is a visualization tool only.
UMAP: Manifold-Based Reduction
Uniform Manifold Approximation and Projection constructs a weighted graph in high-dimensional space (based on k-nearest neighbors) and optimizes a low-dimensional embedding that preserves this graph structure using a cross-entropy loss.
What UMAP preserves: both local structure (like t-SNE) and, better than t-SNE, global structure — the relative positions of clusters are more meaningful.
Practical advantages over t-SNE:
- Faster: scales better to large datasets (near-linear vs. $O(n^2)$ for t-SNE)
- More stable: lower variance across runs with the same random seed
- Usable as preprocessing: the embedding is more stable and globally consistent, making UMAP features (cautiously) more useful than t-SNE features for downstream models
- Handles larger $k$ naturally
When to use UMAP over t-SNE: when you care about the relationships between clusters (not just within-cluster structure), when the dataset is large (>50K points), or when you need a reproducible embedding across multiple runs.
Decision Guide
| Goal | Algorithm |
|---|---|
| Preprocessing, interpretable features | PCA |
| Supervised reduction with class labels | LDA |
| Exploratory visualization, cluster discovery | t-SNE or UMAP |
| Large dataset visualization (>50K points) | UMAP |
| Stable, reproducible embedding | UMAP > t-SNE |
| Understanding global cluster relationships | UMAP |
Common Mistakes
Using t-SNE for preprocessing: t-SNE embeddings should not be fed into a classifier. The distances are non-metric and the embedding is not stable.
Interpreting cluster sizes in t-SNE: cluster sizes and inter-cluster distances in t-SNE are artifacts of the algorithm, not reflections of the actual data density.
Fitting PCA on test data: compute PCA on the training set only and apply the learned transformation to test data. Fitting PCA on test data leaks information.
Choosing perplexity carelessly: t-SNE results depend heavily on perplexity. Always try at least three values (5, 30, 50) and compare the results before drawing conclusions.