Probability Calibration
A model that outputs class labels is often insufficient in practice. Risk systems, recommendation engines, and decision-support tools need calibrated probabilities — where “80% confidence” actually means the prediction is correct 80% of the time.
Most classifiers are not calibrated by default. Tree-based models tend to push probabilities toward 0 and 1. SVMs output margins, not probabilities. Neural networks can be overconfident. Calibration training corrects this.
The Calibration Procedure
- Train the primary model on $D_\text{train}$
- Generate predictions $\hat{y}i$ on $D\text{cv}$ (a held-out calibration set — never the training set)
- Sort predictions $(\hat{y}_i, y_i)$ in increasing order of $\hat{y}_i$
- Bin into $k$ chunks of size $m$; compute $\hat{y}\text{mean}^j$ (mean predicted probability) and $y\text{mean}^j$ (mean actual label) for each bin
- Plot $y_\text{mean}^j$ vs. $\hat{y}_\text{mean}^j$ — this is the calibration plot
- Train a calibration function to map predicted probabilities to actual frequencies
A perfectly calibrated model shows as a diagonal line in the calibration plot. Deviations above the diagonal mean the model is underconfident; below means overconfident.
Calibration Methods
Platt Scaling: fits a logistic sigmoid to the calibration data. Works when the calibration plot looks sigmoid-shaped (typical for SVMs). Fast and requires little data.
Isotonic Regression: fits a piecewise linear monotone function. Works in nearly all cases but requires more calibration data than Platt scaling. The correct choice when the calibration curve is not sigmoid-shaped.
In scikit-learn: CalibratedClassifierCV wraps any classifier and applies either method via cross-validation.
RANSAC: Robust Model Fitting
Standard regression minimizes loss over all points — which means a small number of outliers can badly corrupt the fitted model. RANSAC (Random Sample Consensus) is an iterative algorithm that explicitly identifies and excludes outliers before fitting.
The Procedure
- Sample a random subset $D_0$ from $D_\text{train}$; fit model $M_0$ on $D_0$
- Compute prediction errors for all points in $D_\text{train}$ using $M_0$
- Classify as outliers any point with error above a threshold; call this set $O_0$
- Create filtered data $D^1 = D_\text{train} - O_0$; fit model $M_1$ on $D^1$
- Repeat until consecutive models $M_i$ and $M_{i+1}$ are nearly identical
The result is a model fit on the inlier population — unaffected by the outlier minority regardless of how large their errors are.
RANSAC is standard practice in computer vision (fitting homographies and fundamental matrices) but applies to any regression problem where the training data contains systematic outliers — sensor failures, annotation errors, or distributional contamination.
The Loss Minimization Framework
Every supervised learning algorithm can be written in the same form:
$$\min_{w} \sum_i L(y_i, f(x_i; w)) + \lambda \cdot R(w)$$
where $L$ is the loss function and $R$ is a regularizer. The choice of $L$ determines the algorithm family:
| Loss Function | Algorithm | Property |
|---|---|---|
| Squared error: $(y_i - \hat{y}_i)^2$ | Linear regression | Sensitive to outliers |
| Logistic loss: $\log(1 + e^{-y_i \hat{y}_i})$ | Logistic regression | Smooth approximation to 0-1 loss |
| Hinge loss: $\max(0, 1 - y_i(w^T x_i + b))$ | SVM | Creates margin; sparse solutions |
| Absolute error: $ | y_i - \hat{y}_i | $ |
Hinge loss has a geometric interpretation: it is zero for correctly classified points outside the margin, and grows linearly for points that violate the margin. The SVM objective is just hinge loss plus an $L_2$ regularizer on $w$.
Why this framework matters: when you see a new ML paper, the algorithm is almost always a specific combination of loss function + regularizer + optimization procedure. Understanding the framework means you can immediately understand what the algorithm optimizes, what its failure modes are, and how it relates to existing methods.
Bias-Variance and the Overfit-Underfit Exception
The standard statement: overfitting → high variance, underfitting → high bias.
The exception worth knowing: if the data has a strong majority class (e.g., 95% zeros in a fraud detection dataset), a model that always predicts 0 has low variance (it is deterministic) but also low bias on the majority class. In this case, overfitting to the majority manifests as high bias — the model learns the wrong pattern confidently.
This is why accuracy is a misleading metric for imbalanced classification problems. A model with 95% accuracy on 95% majority data might be doing nothing useful. Use balanced accuracy, F1, or AUC instead.