Data science projects fail more often from process problems than from modeling problems. Wrong problem definition, poor data quality, evaluation on biased test sets, production deployment failures — these are all fixable before you write a single model. This is the end-to-end pipeline, with the failure modes at each stage.
Layer 1: Problem Framing
The most important layer. Everything downstream is constrained by how you frame the problem.
Define the decision, not the metric. The purpose of a model is to improve a decision. What decision will be made with this model’s output? Who makes it? How often?
Identify the baseline. What happens if you do nothing? What’s the current human decision-making process, and what’s its error rate? Your model needs to beat this — and you need to know by how much, not just “better.”
Define success in business terms, not model terms. “95% accuracy” means nothing without context. “Reducing stockouts by 20% in a category that drives $2M/month in revenue” is a success criterion.
What can go wrong: Solving a technically tractable problem that isn’t the actual business problem. Building a recommendation engine when the real problem is user retention. Building a churn predictor when the problem is operationally acting on predictions.
Layer 2: Data Identification and Collection
Before exploring data, understand what data exists and what it actually represents.
Data inventory: What data is available? What’s the update frequency? Who owns it? What are the access constraints? Is there historical data, or only real-time data?
Data lineage: Where does the data come from? Has it been transformed? Were there any system changes that would cause a structural break in the historical data?
Labeling: Is the target variable available historically? If you need labels and they don’t exist, you need a labeling strategy before anything else.
What can go wrong: Assuming data exists that doesn’t. Building a model on data that was collected under different conditions than deployment. Ignoring that the “historical” data has been retroactively corrected.
Layer 3: Exploratory Data Analysis
EDA is systematic pattern discovery, not casual browsing.
Start with the data, not the model. Spend more time on EDA than you think you need. The patterns you find here will inform every subsequent decision.
Check distributions of everything. Raw features, engineered features, the target variable. Look for:
- Outliers (values 10x the mean that aren’t real)
- Missing values (and patterns in missingness — is data missing at random, or does missingness correlate with the target?)
- Class imbalance in the target
- Multicollinearity (highly correlated features that cause redundancy)
- Time-based drift (distributions that change over time in the historical data)
Visualize before summarizing. Summary statistics (mean, std) can be identical across very different distributions (Anscombe’s quartet). Always plot.
What can go wrong: Skipping EDA and discovering data problems after model training. Building intuitions on the training set that don’t hold in the test set.
Layer 4: Feature Engineering
The layer that separates good data scientists from average ones. Covered in full in Feature Engineering: The Skill That Separates Good Models from Bad Ones.
Key principle: features should encode domain knowledge about what is predictive. The model is a lens — it can only see what you put in front of it.
What can go wrong: Feature leakage (using future information to predict the past). Training/serving skew (computing features differently at training vs. inference time). Redundant features that increase complexity without improving performance.
Layer 5: Modeling
The layer most data scientists spend too much time on relative to layers 1–4.
Start simple. A linear model or a well-tuned gradient boosted tree is the right baseline in almost all cases. Understanding why the simple model fails is more valuable than immediately reaching for deep learning.
Cross-validation discipline. Evaluate on held-out data that the model has never seen. Use stratified splits for class-imbalanced problems. Use time-based splits for temporal data (never shuffle time-series).
Hyperparameter tuning. Tune on a validation set, evaluate final performance on a test set that was never touched during tuning. If you tune on the test set, your test performance is not an honest estimate of generalization.
What can go wrong: Tuning hyperparameters on the test set (data leakage). Comparing models with different amounts of tuning effort. Selecting the best model from 50 experiments without correcting for multiple comparisons.
Layer 6: Evaluation
Evaluation is not just “pick the model with the highest accuracy.”
Choose the right metric. Accuracy is meaningless with class imbalance. AUC is appropriate when you need to evaluate across thresholds. Precision@k matters for recommendation systems. Business-specific metrics (revenue impact, cost savings) should be your ultimate evaluation criterion.
Error analysis. Analyze where the model is wrong. Look at the examples with the highest errors. Do they share a pattern? Is the model failing on a specific subgroup? Error analysis often reveals the next round of feature engineering opportunities.
Calibration. Especially for probabilistic outputs: are the model’s predicted probabilities calibrated? A model that predicts 70% probability for events that happen 30% of the time is not useful for decision-making, even if its AUC is high.
What can go wrong: Choosing the metric that makes your model look best rather than the metric that reflects business value. Reporting test performance without confidence intervals (is the improvement real or sampling noise?).
Layer 7: Deployment
Where most academic and research-oriented data scientists underinvest.
Serving infrastructure: How will the model be called? REST API, batch job, embedded in an application? What are the latency requirements?
Model packaging: Serialize the model in a format that can be loaded without the training environment. Version the model artifacts.
Monitoring: Track prediction distributions, input feature distributions, and downstream business metrics. Set alerts for distribution drift.
Rollback plan: What do you do if the model fails in production? A/B test new models before full deployment. Always have a rollback path.
What can go wrong: Training/serving skew (features computed differently). Silent failures (the model runs but produces bad predictions that look plausible). No monitoring, so degradation isn’t detected for weeks.
Layer 8: Iteration and Maintenance
Models are not static artifacts. Data distributions shift, business contexts change, and edge cases emerge.
Concept drift: The relationship between features and target changes over time. A model trained pre-COVID to predict travel demand is useless post-COVID. Set up monitoring to detect when model performance degrades.
Data drift: The distribution of input features changes even if the underlying relationship is stable. A model trained on data from 2021 may have never seen the feature distributions of 2023.
Retraining schedule: How often should the model be retrained? For some models (stable seasonal patterns), annual retraining is fine. For others (financial markets, live user behavior), weekly or daily retraining is necessary.
Technical debt: The pipeline that was “good enough for the prototype” accumulates tech debt. Regularly review whether the codebase is maintainable by someone who didn’t write it.
The Common Thread
Across all 8 layers, the discipline is the same: make assumptions explicit, test them against data, and document what you found and why you made each decision. The biggest failures come from assumptions that were never stated, never tested, and eventually turned out to be wrong.