Anomaly Detection: A Practical Framework

Anomaly detection is one of the most frustrating problems in Machine Learning. You’re usually working with imbalanced data (anomalies are rare by definition) and you rarely have labels for what “anomalous” even means. Two hard problems. One system. No clean ground truth.

Most real anomaly detection systems are built from a mix of techniques, evaluated against domain expertise rather than textbook metrics, and refined based on false positive rates that operators can actually tolerate. That’s not a failure of rigor. That’s the job.

What Makes a Good Anomaly Detector?

Before picking a method, get brutally clear on what you’re detecting and under what constraints.

Do you have labels? Even a small set of confirmed anomalies enables supervised or semi-supervised approaches that will outperform purely unsupervised ones by a wide margin.

What’s the data structure? Tabular? Time series? Graph? Each has its natural methods. Applying the wrong family is a common and expensive mistake.

What’s the false positive tolerance? A fraud alert that’s wrong half the time is useless. An equipment health alert that’s wrong 20% of the time may be perfectly acceptable if catching the failure is worth it. Know the cost before you set the threshold.

How does time factor in? Point anomalies (a single unusual observation), contextual anomalies (a normal value at the wrong time), and collective anomalies (a subsequence that’s unusual as a group) are three distinct problems. Treating them the same is a trap.

Statistical Methods

Z-Score and IQR

The simplest baselines. For each feature, flag observations beyond k standard deviations from the mean (Z-score) or outside Tukey’s fence based on the interquartile range.

Use when: univariate, roughly Gaussian data. Excellent for quick sanity checks and initial data cleaning.

Limitations: collapses under skewed distributions, ignores multivariate structure, requires independent univariate analysis. Don’t stop here.

Mahalanobis Distance

Generalizes the Z-score to multivariate data. Accounts for correlations between features.

$$d_M(x) = \sqrt{(x - \mu)^T \Sigma^{-1} (x - \mu)}$$

A point is anomalous if its Mahalanobis distance from the distribution center exceeds a threshold.

Use when: multivariate, roughly Gaussian data with correlated features. Works well for fraud detection on financial feature vectors.

Limitations: assumes Gaussian distribution. Covariance matrix inversion is numerically unstable in high dimensions. Breaks with nonlinear anomaly patterns.

Machine Learning-Based Methods

Isolation Forest

The most practically useful general-purpose anomaly detector for tabular data. The core intuition: anomalies are few and different. They’re isolated from the bulk of the data and can be separated from it with fewer random splits.

The algorithm:

Build an ensemble of random trees by recursively splitting features at random split points
For each point, record the average depth at which it gets isolated
Points isolated early (short path length) are anomalies - they’re easy to separate from everything else

Anomaly score = normalized average path length. Lower score means more anomalous.

from sklearn.ensemble import IsolationForest

clf = IsolationForest(
    n_estimators=100,
    contamination=0.05,  # expected fraction of anomalies
    random_state=42
)
scores = clf.fit_predict(X)  # -1 for anomalies, 1 for normal
anomaly_scores = clf.score_samples(X)  # continuous score

Advantages:

No distributional assumptions
Scales well: O(n log n) training, O(log n) scoring
Works on high-dimensional data
Only hyperparameter that matters: contamination (your prior on anomaly rate)

Limitations: works on global anomalies, not contextual ones. If anomalies are only unusual in a specific subspace, Isolation Forest may miss them.

Extended Isolation Forest addresses a known bias in the original (axis-aligned splits create artifacts near certain regions). Use it for high-dimensional data.

Local Outlier Factor (LOF)

Density-based. A point is anomalous if its local density is much lower than its neighbors’. Computes the ratio of average local density of k nearest neighbors to the point’s own local density.

Use when: anomalies are defined by local context rather than global statistics. A transaction that’s normal in the evening is anomalous at 3am. LOF captures this kind of local density variation.

Limitations: slow at inference time (requires recomputing nearest neighbors for each new point). Not suitable for streaming or high-throughput applications.

DBSCAN for Anomaly Detection

DBSCAN is a clustering algorithm, but its output includes a special class: noise points - observations that don’t belong to any cluster. Those are your anomalies.

from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.5, min_samples=10)
labels = db.fit_predict(X)
anomalies = X[labels == -1]  # noise points = anomalies

Use when: anomalies form no clusters (they’re truly isolated), and you expect dense “normal” regions with anomalies scattered in sparse space between them. Works well for geographic anomalies and sensor data with clear cluster structure.

Limitations: eps (neighborhood radius) is sensitive and hard to tune. Struggles with clusters of varying density. Produces binary labels, not anomaly scores.

Autoencoder-Based Detection

For complex, high-dimensional data - images, text, multi-sensor time series - autoencoders provide a principled anomaly score.

The logic: train an autoencoder to reconstruct normal data. Anomalous data, never seen during training, gets reconstructed poorly. High reconstruction error means anomalous.

import torch
import torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, input_dim, latent_dim):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, latent_dim)
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 64),
            nn.ReLU(),
            nn.Linear(64, input_dim)
        )
    
    def forward(self, x):
        return self.decoder(self.encoder(x))
    
    def anomaly_score(self, x):
        with torch.no_grad():
            return ((x - self(x))**2).mean(dim=1)

Training: on normal data only. If labels are unavailable, train on all data - the autoencoder will learn the dominant pattern, making rare anomalies stand out as high-error outliers.

Threshold: set based on reconstruction error distribution. Flag points in the top 1 to 5% depending on your contamination prior.

Variational Autoencoders (VAEs): add a probabilistic bottleneck that produces better-calibrated anomaly scores. More complex, but useful when you need probability estimates rather than just rankings.

Time-Series Anomaly Detection

Time-series anomaly detection has additional structure: context matters. An outlier in a time series might be:

A point anomaly: a single point far from expected
A contextual anomaly: a value that’s normal in absolute terms but anomalous given recent history
A collective anomaly: a subsequence that’s anomalous as a group even if each point looks fine individually

Forecast-Based Methods

Train a forecasting model on historical data. Flag observations where the actual value deviates significantly from the forecast.

from prophet import Prophet

model = Prophet()
model.fit(df_train)
forecast = model.predict(df_test)

# Anomaly if actual > upper bound or < lower bound
df_test['anomaly'] = (
    (df_test['y'] > forecast['yhat_upper']) | 
    (df_test['y'] < forecast['yhat_lower'])
)

The forecast model captures what’s expected. Deviations from expectation are anomalies. Seasonality and trend are handled automatically.

Models for different settings:

Prophet: seasonal time series, nonstationary, handles holidays
SARIMA: stationary, well understood, interpretable
LSTM autoencoders: complex multivariate time series with nonlinear patterns

Statistical Process Control

For manufacturing and operations: control charts (X-bar, CUSUM, EWMA) have decades of theory behind them and are easily interpretable by domain experts. Often preferable to Machine Learning methods when the process is well understood and the cost of unexplained black-box alerts is high.

Setting Thresholds: The Real Challenge

Every method produces a score or a binary prediction. For scores, you need a threshold. Setting it is the most consequential and least discussed part of anomaly detection.

Contamination prior: if you have a reasonable estimate of what fraction of your data is anomalous, set the threshold to flag that fraction. Isolation Forest’s contamination parameter does this directly.

Operator feedback loop: start with a loose threshold, show flagged anomalies to domain experts, and tighten or loosen based on their feedback. Label confirmed anomalies and false alarms. After enough feedback, you have labeled data for a supervised approach.

Cost-based thresholding: calculate the cost of false positives (investigating a nonanomalous event) versus false negatives (missing a real anomaly). Set the threshold where expected cost is minimized. This requires explicit cost estimates, which forces the business to think clearly about what the system is actually for.

Precision-recall tradeoff: plot precision-recall curves across thresholds. Choose the operating point that matches your operational requirements.

The Semi-Supervised Shortcut

When you have even a small number of labeled anomalies (10 to 50), use them.

Train Isolation Forest or LOF for initial scoring
Use labeled anomalies as a validation set to calibrate the threshold
If you have enough, train a binary classifier on high-scoring points (labeled anomalies plus labeled normals from your data)
The classifier will generalize the anomaly pattern better than unsupervised scoring alone

The “Self-supervised, Refine, Repeat” (SRR) approach formalizes this: build an ensemble of one-class classifiers, use consensus among them to identify likely normal points, train on the refined data, repeat. This handles the case where anomaly labels are completely unavailable and the anomaly rate is unknown.

Practical Recommendations

Data type	First try	If it fails
Tabular, low dimensional	Z-score + IQR per feature	Isolation Forest
Tabular, multivariate	Isolation Forest	Autoencoder or Mahalanobis
Time series, univariate	Prophet-based forecast	CUSUM / EWMA
Time series, multivariate	LSTM autoencoder	Per-feature + clustering
High dimensional (images)	Autoencoder	VAE with probability score
With labeled anomalies	Binary classifier	Ensemble with calibration

Start simple. Isolation Forest handles 80% of tabular anomaly detection cases well. Only escalate to autoencoders or complex time-series methods when the simpler approaches clearly fail. The complexity isn’t impressive if the simple method already works.