Overview
Hydroponics farming is a data-rich environment — environmental sensors track temperature, humidity, nutrient levels, and light exposure continuously. But most hydroponics operations weren’t using this data systematically. The project was to build an ML pipeline that could predict crop yields from environmental and operational data, and connect those predictions to demand forecasting for better crop planning.
The Problem
Traditional farming relies heavily on experience and intuition. In hydroponics — where environmental conditions can be precisely controlled — there’s an opportunity to be much more systematic. But without a data-driven model, the operation couldn’t answer basic questions: which crops should we grow more of next cycle? How does nutrient mix affect final yield? When should we plant to meet forecasted demand?
Data & Inputs
- Environmental sensor data: temperature, humidity, CO₂ levels, nutrient concentrations (N, P, K), pH, electrical conductivity — time-series at hourly resolution
- Growth stage data: crop type, days since planting, pest/disease flags
- Historical sales data: demand and price per crop type
- Climate data for the growing facility
Approach
Three connected models:
Yield prediction: Decision trees, random forests, and a shallow neural network to predict final yield weight from environmental features and growth stage data. Feature importance analysis to identify the most impactful controllable variables.
Demand forecasting: Time-series model on historical sales data by crop type — capturing weekly seasonality and trend.
Planning integration: Combined yield predictions with demand forecasts to generate planting recommendations — how many plants of each crop type to start each week to meet forecasted demand.
Results & Impact
- Working end-to-end pipeline from raw sensor data to planting recommendations
- Feature importance analysis surfaced key yield drivers: nutrient concentration and light exposure were the highest-impact controllable variables
- Decision support tool enabling data-driven crop planning decisions
Technical Detail
Data characteristics: environmental sensors logged at hourly resolution across multiple growing cycles. Key challenge: sensors in a hydroponics system are highly correlated — temperature, humidity, and CO₂ all co-vary with the HVAC system, creating multicollinearity that standard linear models amplify rather than handle gracefully.
Feature engineering:
- Rolling aggregates per growing cycle: mean, min, max, and standard deviation of each environmental variable over the past 7, 14, and 21 days — capturing accumulated stress or favorable conditions rather than instantaneous readings
- Days-since-planting modeled as a polynomial feature to capture the sigmoidal growth curve
- Crop type as one-hot categorical
- Nutrient concentration ratios (N:P, P:K) rather than raw values — relative ratios were more predictive than absolute concentrations, consistent with plant nutrient uptake dynamics
Model comparison:
- Decision tree: interpretable baseline with useful feature importance output
- Random forest: best overall validation performance — reduced variance over the single tree
- XGBoost: marginal improvement with regularization on the larger feature set
- Shallow neural network (2 hidden layers, ReLU): underperformed tree ensembles — insufficient data volume for non-linear feature learning to overcome the inductive bias of gradient-boosted trees on tabular data
Feature importance findings: nitrogen concentration and cumulative light exposure (daily light integral, DLI) were the two highest-impact controllable variables. Temperature contributed meaningfully but showed diminishing returns beyond a crop-specific comfort range. This finding directly shaped operational recommendations — the operation prioritized automated nutrient dosing and supplemental lighting as the highest-ROI improvements.
Planning integration: the yield prediction connected to a time-series demand forecast (exponential smoothing on historical weekly sales by crop type). The output was a weekly recommendation matrix — crop type × recommended planting quantity — to meet 4-week-ahead demand within the predicted yield range. This moved crop cycle planning from intuition-driven to data-driven.
Stack
Python, Scikit-learn, XGBoost, TensorFlow, Pandas, NumPy, Matplotlib