Anomaly detection is one of the more frustrating tasks in ML because it straddles two hard problems simultaneously: you’re typically working with imbalanced data (anomalies are rare by definition) and you often don’t have labels for what “anomalous” means.
Most real anomaly detection systems are built from a combination of techniques, evaluated against domain expertise rather than clean ground truth, and iterated based on false positive rates that operators find tolerable. This is normal — the goal is practical usefulness, not theoretical purity.
What Makes a Good Anomaly Detector?
Before choosing a method, be clear about what you’re trying to detect and under what constraints:
- Do you have labels? Even a small set of labeled anomalies enables supervised or semi-supervised approaches that dramatically outperform purely unsupervised ones.
- What’s the data structure? Tabular? Time-series? Graph? Each has different natural methods.
- What’s the false positive tolerance? A fraud alert that’s wrong 50% of the time is useless. An equipment health alert that’s wrong 20% of the time may be acceptable if catching failures is critical.
- How does time factor in? Point anomalies (a single unusual observation) vs. contextual anomalies (normal value at the wrong time) vs. collective anomalies (a subsequence that’s unusual together) are different problems.
Statistical Methods
Z-Score and IQR
The simplest baselines. For each feature, flag observations that are more than $k$ standard deviations from the mean (Z-score) or outside $[Q1 - 1.5 \times IQR, Q3 + 1.5 \times IQR]$ (Tukey’s fence).
Use when: univariate, roughly Gaussian data. Quick sanity check or initial data cleaning. Limitations: breaks down for skewed distributions, doesn’t capture multivariate structure, requires independent univariate analysis.
Mahalanobis Distance
Generalization of Z-score to multivariate data. Accounts for correlations between features.
$$d_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}$$
A point is anomalous if its Mahalanobis distance from the distribution center exceeds a threshold.
Use when: multivariate, roughly Gaussian data with correlated features. Works well for fraud detection on financial feature vectors. Limitations: assumes Gaussian distribution. Covariance matrix inversion is numerically unstable in high dimensions. Breaks with non-linear anomaly patterns.
ML-Based Methods
Isolation Forest
The most practically useful general-purpose anomaly detector for tabular data. The intuition: anomalies are “few and different” — they’re isolated from the bulk of data and can be separated from it with fewer random splits.
The algorithm:
- Build an ensemble of random trees by recursively splitting features at random split points
- For each point, record the average depth at which it gets isolated
- Points that get isolated early (short path length) are anomalies — they’re easy to separate from everything else
Anomaly score = normalized average path length (lower = more anomalous).
from sklearn.ensemble import IsolationForest
clf = IsolationForest(
n_estimators=100,
contamination=0.05, # expected fraction of anomalies
random_state=42
)
scores = clf.fit_predict(X) # -1 for anomalies, 1 for normal
anomaly_scores = clf.score_samples(X) # continuous score
Advantages:
- No distributional assumptions
- Scales well: O(n log n) training, O(log n) scoring
- Works on high-dimensional data
- Only hyperparameter that matters:
contamination(your prior on anomaly rate)
Limitations: works on global anomalies, not contextual ones. If anomalies are only unusual in a specific subspace, Isolation Forest may miss them.
Extended Isolation Forest addresses a known bias in the original (axis-aligned splits create artifacts near certain regions). Use this for high-dimensional data.
Local Outlier Factor (LOF)
Density-based. A point is anomalous if its local density is much lower than its neighbors’. Computes the ratio of average local density of k-nearest neighbors to the point’s own local density.
Use when: anomalies are defined by local context rather than global statistics. A night-time transaction that’s normal in the evening is anomalous at 3am — LOF captures this kind of local density variation. Limitations: slow at inference time (requires re-computing nearest neighbors for each new point). Not suitable for streaming or high-throughput applications.
DBSCAN for Anomaly Detection
DBSCAN is a clustering algorithm, but its output includes a special class: noise points — observations that don’t belong to any cluster. These are your anomalies.
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.5, min_samples=10)
labels = db.fit_predict(X)
anomalies = X[labels == -1] # noise points = anomalies
Use when: anomalies form no clusters (they’re truly isolated), and you expect dense “normal” regions with anomalies in sparse space between them. Works well for geographic anomalies, sensor data with clear cluster structure.
Limitations: eps (neighborhood radius) is sensitive and hard to tune. Struggles with clusters of varying density. Doesn’t produce anomaly scores — only binary labels.
Autoencoder-Based Detection
For complex, high-dimensional data (images, text, multi-sensor time-series), autoencoders provide a principled anomaly score.
The idea: train an autoencoder to reconstruct normal data. Anomalous data, never seen in training, will be reconstructed poorly. High reconstruction error = anomalous.
import torch
import torch.nn as nn
class Autoencoder(nn.Module):
def __init__(self, input_dim, latent_dim):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(input_dim, 64),
nn.ReLU(),
nn.Linear(64, latent_dim)
)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 64),
nn.ReLU(),
nn.Linear(64, input_dim)
)
def forward(self, x):
return self.decoder(self.encoder(x))
def anomaly_score(self, x):
with torch.no_grad():
return ((x - self(x))**2).mean(dim=1)
Training: on normal data only (or on all data if labels are unavailable — the autoencoder will learn the dominant pattern, making rare anomalies high-error).
Threshold: set based on reconstruction error distribution. Flag points in the top 1–5% of reconstruction error (depending on your contamination prior).
Variational Autoencoders (VAEs): add a probabilistic bottleneck that provides better-calibrated anomaly scores. More complex but useful when you need probability estimates rather than just rankings.
Time-Series Anomaly Detection
Time-series anomaly detection has additional structure: context matters. An outlier in a time series might be:
- Point anomaly: a single point that’s far from expected
- Contextual anomaly: a value that’s normal in absolute terms but anomalous given recent history (e.g., high sales on a normal Tuesday when Monday was a holiday)
- Collective anomaly: a subsequence that’s anomalous together (an unusual pattern that looks fine point-by-point)
Forecast-Based Methods
Train a forecasting model on historical data. Flag observations where the actual value deviates significantly from the forecast.
from prophet import Prophet
model = Prophet()
model.fit(df_train)
forecast = model.predict(df_test)
# Anomaly if actual > upper bound or < lower bound
df_test['anomaly'] = (
(df_test['y'] > forecast['yhat_upper']) |
(df_test['y'] < forecast['yhat_lower'])
)
This naturally handles seasonality and trend. The forecast model captures what’s expected; deviations from expectation are anomalies.
Models for different settings:
- Prophet: seasonal time series, non-stationary, handles holidays
- SARIMA: stationary, well-understood, interpretable
- LSTM autoencoders: complex multi-variate time series with non-linear patterns
Statistical Process Control
For manufacturing and operations: control charts (X-bar, CUSUM, EWMA) have decades of theory behind them and are easily interpretable by domain experts. Often preferable to ML methods when the process is well-understood and the cost of unexplained black-box alerts is high.
Setting Thresholds: The Real Challenge
Every method produces a score or a binary prediction. For scores, you need a threshold. Setting it is the most consequential — and least discussed — part of anomaly detection.
Contamination prior: if you have a reasonable estimate of what fraction of your data is anomalous, set the threshold to flag that fraction. Isolation Forest’s contamination parameter does this directly.
Operator feedback loop: start with a loose threshold, show flagged anomalies to domain experts, and tighten or loosen based on their feedback. Label confirmed anomalies and false alarms. After enough feedback, you have labeled data for a supervised approach.
Cost-based thresholding: calculate the cost of false positives (investigating a non-anomaly) vs. false negatives (missing a real anomaly). Set the threshold where expected cost is minimized. This requires explicit cost estimates, which forces the business to think clearly about what the system is for.
Precision-recall tradeoff: plot precision-recall curves across thresholds. Choose the operating point that matches your operational requirements.
The Semi-Supervised Shortcut
When you have even a small number of labeled anomalies (10–50), use them:
- Train Isolation Forest or LOF for initial scoring
- Use labeled anomalies as a validation set to calibrate the threshold
- If you have enough, train a binary classifier on the high-scoring points (labeled anomalies + labeled normals from your data)
- The classifier will generalize the anomaly pattern better than unsupervised scoring alone
The paper “Self-supervised, Refine, Repeat” (SRR) formalizes this: build an ensemble of one-class classifiers, use consensus among them to identify likely normal points, train on the refined data, repeat. This handles the case where anomaly labels are completely unavailable and the anomaly rate is unknown.
Practical Recommendations
| Data type | First try | If it fails |
|---|---|---|
| Tabular, low-dimensional | Z-score + IQR per feature | Isolation Forest |
| Tabular, multivariate | Isolation Forest | Autoencoder or Mahalanobis |
| Time-series, univariate | Prophet-based forecast | CUSUM / EWMA |
| Time-series, multivariate | LSTM autoencoder | Per-feature + clustering |
| High-dimensional (images) | Autoencoder | VAE with probability score |
| With labeled anomalies | Binary classifier | Ensemble with calibration |
Start simple. Isolation Forest handles 80% of tabular anomaly detection cases well. Only move to autoencoders or complex time-series methods when the simpler approaches clearly fail.