When people outside data science imagine an ML project, they picture: load data, train model, get results. When they budget for one, they budget for maybe a month.
Here’s what the actual distribution of time looks like across real ML projects:
| Phase | Expected (by non-practitioners) | Actual |
|---|---|---|
| Problem framing | 5% | 10–20% |
| Data collection & cleaning | 10% | 30–50% |
| Feature engineering | 10% | 15–25% |
| Modeling | 40% | 10–20% |
| Evaluation & validation | 15% | 10–15% |
| Deployment & monitoring | 10% | 20–30% |
| Iteration & maintenance | 10% | Ongoing |
Modeling is the smallest part. Data work and deployment are the largest. This is the first thing every ML practitioner knows and every non-practitioner is surprised by.
Phase 1: Problem Framing (10–20%)
The most important phase — and the easiest to rush. Most failed ML projects fail here, not in the model.
The two framing failures:
-
Solving the wrong problem. Optimizing for a metric that doesn’t represent the actual business objective. Maximizing recommendation click-through rate when the goal is to reduce customer churn. Minimizing forecast MAE when the business cares about stockout rate, not average accuracy.
-
Underspecifying the problem. “Can we predict which customers will churn?” is not a problem specification. “Can we predict which customers will cancel their subscription in the next 30 days with enough lead time for retention outreach, at a precision high enough that the outreach cost is covered by the expected retention value?” is a problem specification.
The work in this phase: structured conversations with domain experts and business stakeholders, a written problem statement with success criteria, and a back-of-envelope feasibility check (does the data exist? is the signal strong enough to be learnable?).
Phase 2: Data Collection and Cleaning (30–50%)
Half your project, consistently. The reasons:
Data access friction: Data lives in different systems, owned by different teams, with different access controls. Getting a data dictionary, an API key, and a legal review for each data source takes time that doesn’t appear in any model.
Data quality problems: Missing values, encoding errors, duplicate records, structural breaks (the ERP system changed in 2019 and the data before and after isn’t directly comparable), outliers that are real vs. outliers that are errors. None of these are self-documenting.
Label creation: If your target variable isn’t something the business already measures, you need to create labels. For some problems (fraud, returns, churn), labels exist in historical data. For others, they require a labeling process (human annotation, proxy metrics, natural experiments).
Pipeline reliability: A data pipeline that works once on your laptop is not a data pipeline. Production-grade data collection requires error handling, monitoring, retry logic, incremental updates, and schema validation. This takes engineering time.
The key deliverable: a cleaned, documented dataset with a data card — summary statistics, known issues, data dictionary, and provenance. Every project should have one. Almost none do.
Phase 3: Feature Engineering (15–25%)
Covered in depth separately. The note for the lifecycle perspective: feature engineering is not a one-time step. You return to it after modeling, when error analysis reveals what the model is missing.
The architecture decision that matters here: whether to use a feature store. For one-off projects, computing features inline is fine. For production systems with multiple models sharing features, a feature store prevents the training/serving skew that causes silent production failures.
Phase 4: Modeling (10–20%)
The shortest phase, but the one with the most cultural weight.
The baseline first rule: Always train the simplest model that could work before reaching for complexity. You need to know what you’re improving on. A logistic regression or gradient boosted tree with good features outperforms a deep neural network with poor features on most tabular problems.
The hyperparameter tuning trap: Spending days tuning a model when the same time spent on feature engineering would produce more improvement. The rule of thumb: feature engineering gives linear improvements in model quality; hyperparameter tuning gives logarithmic improvements. Do features first.
The experiment tracking requirement: Use MLflow, Weights & Biases, or equivalent from the first experiment. Reconstruction of “what exactly was that run that worked 3 weeks ago?” from memory is painful.
Phase 5: Evaluation (10–15%)
Where most academic projects stop and most production projects begin.
The evaluation gap: Test set performance is an estimate of production performance under assumptions that production will violate. The assumptions:
- Data distribution is stable (it won’t be)
- The test set is representative (it may not be)
- The target variable in production is the same as in training (it sometimes isn’t)
Error analysis: The most important thing you can do after computing an overall metric. Find the failure cases. Understand why they fail. The patterns in failures are the roadmap to improvement — and also to the documentation that tells future maintainers where the model will break.
Business metric translation: What does “MAE of 15%” on the test set mean in business terms? Run the model through a business impact simulation. How many stockouts does this forecast accuracy level lead to? What’s the value of reducing MAE by 5%? This is the conversation that gets model improvements actually deployed.
Phase 6: Deployment (20–30%)
Where machine learning systems fail most expensively. The source of most failure: assumptions in the model that don’t hold in production.
Training/serving skew: The most common production failure. Features computed differently at training and serving time. A model trained on a feature that’s computed with “sum of all historical values” will fail in serving if serving computes “sum of last 90 days.” These differences are subtle and catastrophic.
Latency requirements: A model that runs in 10 seconds is fine for batch processing. It’s unusable for real-time recommendations. Know the latency budget before building the model — architecture choices (batch vs. real-time, embedded vs. API) depend on it.
Monitoring: At minimum: track prediction distribution and key input feature distributions. Detect drift. Alert when the model’s outputs change in unexpected ways. Without monitoring, you will not know when the model is producing garbage.
Model versioning and rollback: Every model deployment should have a clear rollback path. This requires keeping the previous model version running until the new one is validated.
Phase 7: Iteration and Maintenance (Ongoing)
Models are not done when deployed. The ongoing work:
- Retraining on new data (schedule depends on how fast the world changes)
- Responding to concept drift alerts (the relationship between features and target has changed)
- Adding features as new data sources become available
- Fixing edge cases discovered in production
- Managing technical debt in the feature pipeline
The hard organizational question: who owns the model in production? The team that built it is often off to the next project. The team that owns the system it’s embedded in often doesn’t have the ML expertise to maintain it. This ownership gap is the most common cause of model degradation.
The Bottom Line for Planning
For a production ML project, a realistic timeline is 3–6 months for the first deployment of a non-trivial model. Of that time:
- 2–4 weeks on problem framing and stakeholder alignment
- 4–8 weeks on data access, cleaning, and pipeline setup
- 2–4 weeks on feature engineering and initial modeling
- 2–4 weeks on evaluation, hardening, and deployment
- Ongoing: monitoring, iteration, maintenance
The organizations that consistently deliver ML value are the ones that plan for the real timeline, not the modeling-centric fantasy.