Loss functions and evaluation metrics are distinct things that get conflated constantly. The loss function is what the optimizer minimizes during training. The evaluation metric is what you actually care about. They should be related — but they often aren’t identical, because real-world objectives aren’t always differentiable.
Understanding when to use which matters more than memorizing the formulas.
Regression Losses
Mean Squared Error (MSE)
$$L = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2$$
The default for regression. Penalizes large errors quadratically — a prediction that’s 10 units off contributes 100x more to the loss than a prediction that’s 1 unit off. This makes MSE sensitive to outliers.
Use when: outlier errors are genuinely bad and you want the model to try hard to avoid them. Sales forecasting, demand prediction, structural load calculations.
Don’t use when: your target has heavy-tailed noise that produces outliers you don’t care about correcting.
A property worth knowing: minimizing MSE is equivalent to predicting the conditional mean $E[y|x]$.
Mean Absolute Error (MAE)
$$L = \frac{1}{n}\sum_{i=1}^n |y_i - \hat{y}_i|$$
Linear penalty — errors contribute proportionally. Outliers have less influence than in MSE. Not differentiable at zero, which creates gradient instability near the optimum.
Use when: outliers are common noise you want to ignore. Median prediction is the right target. Easily interpretable (same units as the target).
Mathematically: minimizing MAE is equivalent to predicting the conditional median.
Huber Loss
$$L_\delta = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & |y - \hat{y}| \leq \delta \ \delta|y - \hat{y}| - \frac{1}{2}\delta^2 & |y - \hat{y}| > \delta \end{cases}$$
Quadratic for small errors (like MSE), linear for large errors (like MAE). Differentiable everywhere. The $\delta$ hyperparameter controls the transition.
Use when: you want outlier robustness but also smooth gradients. The best default choice for regression when you’re uncertain about outlier behavior.
Quantile Loss
$$L_q = \begin{cases} q(y - \hat{y}) & y \geq \hat{y} \ (1-q)(\hat{y} - y) & y < \hat{y} \end{cases}$$
Predicts the $q$-th quantile of the distribution. Set $q = 0.9$ to predict the 90th percentile. Set $q = 0.5$ to predict the median (equivalent to MAE).
Use when: asymmetric costs — overpredicting is worse than underpredicting or vice versa. Safety stock calculations (you want the 95th percentile of demand, not the mean). Prediction intervals (train two models at $q = 0.1$ and $q = 0.9$).
Classification Losses
Binary Cross-Entropy
$$L = -\frac{1}{n}\sum_{i=1}^n [y_i \log p_i + (1-y_i)\log(1-p_i)]$$
The standard loss for binary classification. Derived from maximizing the likelihood under a Bernoulli model. When the true label is 1, penalizes low predicted probability. When the true label is 0, penalizes high predicted probability.
Behaves well throughout training — large gradients when the model is very wrong, small gradients when it’s right. Logistic regression directly minimizes this.
Use when: standard binary classification. Almost always the right choice.
Categorical Cross-Entropy
$$L = -\frac{1}{n}\sum_{i=1}^n \sum_c y_{ic} \log p_{ic}$$
Extends binary cross-entropy to multiple classes. The predicted output is a probability distribution over classes (softmax output). Equivalently, it’s the KL divergence between the one-hot true distribution and the predicted distribution.
Use when: multi-class classification where each example belongs to exactly one class.
Focal Loss
$$L_{FL} = -\alpha_t(1-p_t)^\gamma \log(p_t)$$
A modification of cross-entropy that down-weights easy examples (high $p_t$) and focuses training on hard ones. The $\gamma$ parameter controls how aggressively easy examples are discounted.
Use when: severe class imbalance (e.g., object detection where most anchors are background). Standard cross-entropy gets dominated by easy negatives; focal loss keeps the model learning from the minority class.
Hinge Loss
$$L = \max(0, 1 - y \cdot \hat{y})$$
The SVM loss. No gradient when the prediction is correct with sufficient margin. Encourages a decision boundary with maximum margin.
Use when: training SVMs or you explicitly want a margin-based classifier. Rarely used in deep learning.
Evaluation Metrics — Classification
Accuracy
$$\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$
The percentage correct. Simple, interpretable, and almost always misleading on imbalanced datasets. A model that always predicts the majority class achieves 99% accuracy on a 1% positive rate dataset while being useless.
Only use when classes are roughly balanced and all errors have equal cost.
Precision, Recall, F1
$$\text{Precision} = \frac{TP}{TP + FP} \quad \text{Recall} = \frac{TP}{TP + FN}$$
$$F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$
Precision: of all positive predictions, how many are actually positive? High precision means few false alarms.
Recall: of all actual positives, how many did we catch? High recall means few missed cases.
These tradeoff against each other. You control the tradeoff by adjusting the classification threshold. F1 is their harmonic mean — a single number that penalizes imbalance between the two.
Use precision when: false positives are expensive (spam filter, ad targeting). Use recall when: false negatives are expensive (cancer screening, fraud detection). Use F1 when: you want a balanced single metric.
For imbalanced classes, $F_\beta$ generalizes F1 with weight $\beta$ on recall:
$$F_\beta = (1+\beta^2) \cdot \frac{\text{Precision} \cdot \recall}{\beta^2 \cdot \text{Precision} + \text{Recall}}$$
AUC-ROC
The ROC curve plots True Positive Rate vs False Positive Rate across all classification thresholds. AUC is the area under this curve.
AUC = 0.5 is random. AUC = 1.0 is perfect. AUC = 0.8 means an 80% chance that a randomly chosen positive example gets a higher predicted score than a randomly chosen negative example.
Use when: ranking matters more than threshold-based classification. Useful for imbalanced problems (unlike accuracy). Limitation: misleading when false positives and false negatives have very different costs.
AUC-PR (area under precision-recall curve) is more informative on heavily imbalanced datasets where the positive class is rare. The ROC curve’s FPR denominator is dominated by the large negative class; PR focuses only on positive class performance.
Calibration
A model that predicts 70% probability should be right 70% of the time. Calibration measures how well predicted probabilities match actual frequencies.
Reliability diagrams: bin predictions by predicted probability, plot actual frequency in each bin. A well-calibrated model follows the diagonal.
Well-calibrated models are essential when the probability itself (not just the ranking) matters — credit scoring, medical diagnosis, risk assessment.
Evaluation Metrics — Regression
RMSE
$$RMSE = \sqrt{\frac{1}{n}\sum(y_i - \hat{y}_i)^2}$$
Square root of MSE. Same units as the target, easier to interpret than MSE. Still sensitive to outliers.
MAPE and sMAPE
$$MAPE = \frac{100%}{n}\sum\left|\frac{y_i - \hat{y}_i}{y_i}\right|$$
Percentage error — scale-invariant, useful for comparing models across datasets with different magnitudes. Problem: undefined when $y_i = 0$ and asymmetric (overestimates are penalized differently from underestimates).
sMAPE (symmetric MAPE) addresses the asymmetry:
$$sMAPE = \frac{200%}{n}\sum\frac{|y_i - \hat{y}_i|}{|y_i| + |\hat{y}_i|}$$
Neither handles zero targets well. For demand forecasting, use WAPE (weighted absolute percentage error, essentially MAE/sum of actuals) instead.
Ranking Metrics
MRR (Mean Reciprocal Rank)
$$MRR = \frac{1}{|Q|}\sum_{i=1}^{|Q|}\frac{1}{rank_i}$$
Average of reciprocal ranks of the first relevant result across queries. Useful when you care about where the first relevant result appears (search, Q&A).
NDCG (Normalized Discounted Cumulative Gain)
$$DCG = \sum_{i=1}^{k}\frac{rel_i}{\log_2(i+1)} \quad NDCG = \frac{DCG}{IDCG}$$
Rewards relevant documents at the top of the ranking, with logarithmic discounting for lower positions. NDCG normalizes by the ideal ordering. Standard metric for search ranking and recommendation systems.
Choosing Loss vs Metric
The loss is what you optimize. The metric is what you report. They should align but need not be identical:
| Task | Loss | Metric |
|---|---|---|
| Binary classification | Binary cross-entropy | AUC, F1, or precision/recall |
| Multi-class | Categorical cross-entropy | Accuracy (if balanced), macro F1 (if not) |
| Regression | MSE or Huber | RMSE, MAE, or MAPE |
| Imbalanced classification | Focal loss | AUC-PR |
| Ranking | Pairwise or listwise loss | NDCG, MRR |
| Probabilistic output | Log-loss | Calibration + AUC |
One common mistake: reporting accuracy on an imbalanced problem, seeing 99%, and concluding the model works. Always match your evaluation metric to the actual cost structure of the problem.
Note: linear regression, logistic regression, and SVMs have convex loss functions — one global minimum, no local optima. Deep networks with two or more layers have non-convex losses — multiple local minima. This changes the optimization landscape significantly.