Context
Hydroponics is a data-rich environment. Environmental sensors track temperature, humidity, nutrient levels, and light exposure continuously - far more controllable than soil farming. But logging data and using it are different things, and the data was sitting unused.
This project was about closing that gap end to end: predict crop yield from environmental and operational data, find which of those variables actually move yield, then connect the predictions to demand so the output is an actual decision - what to plant, and how much - rather than a number on a dashboard.
The Problem
Where conditions can be precisely controlled, planning by intuition leaves value on the table. Without a model, basic questions had no evidence-backed answer: Which crops should we grow more of next cycle? How does nutrient mix affect final yield? When should we plant to meet forecasted demand?
Each answer was a good guess - experience-informed, but still a guess. The goal was to replace the guessing with something measurable, and to be honest about which variables are worth acting on.
Data & Inputs
- Environmental sensor data: temperature, humidity, CO₂, nutrient concentrations (N, P, K), pH, and electrical conductivity - time-series at hourly resolution
- Growth-stage data: crop type, days since planting, pest/disease flags
- Historical sales data: demand and price per crop type
- Climate data for the growing facility
Approach
Three connected models, deliberately sequenced so the output is a decision, not a prediction.
Yield prediction. Decision trees, random forests, and a shallow neural network to predict final yield weight from environmental and growth-stage features. Feature-importance analysis to find the most impactful controllable variables - the ones worth acting on.
Demand forecasting. A time-series model on historical sales by crop type, capturing weekly seasonality and trend.
Planning integration. Yield predictions combined with the demand forecast to generate planting recommendations - how many plants of each crop type to start each week to meet forecasted demand within the predicted yield range.
Results
- A working end-to-end pipeline from raw hourly sensor data to a weekly planting recommendation matrix
- Feature-importance analysis surfaced the key yield drivers - nitrogen concentration and cumulative light exposure as the two highest-impact controllable variables
- A decision-support output that frames crop planning as evidence rather than intuition
Technical Detail
Data characteristics. Environmental sensors logged at hourly resolution across multiple growing cycles. The core modeling challenge: sensors in a hydroponics system are highly correlated. Temperature, humidity, and CO₂ all co-vary with the HVAC system, creating multicollinearity that standard linear models amplify rather than handle gracefully - which is a large part of why tree ensembles won out.
Feature engineering.
- Rolling aggregates per growing cycle - mean, min, max, and standard deviation of each environmental variable over the past 7, 14, and 21 days - capturing accumulated stress or favorable conditions rather than instantaneous readings
- Days-since-planting modeled as a polynomial feature to capture the sigmoidal growth curve
- Crop type as one-hot categorical
- Nutrient concentration ratios (N:P, P:K) rather than raw values - relative ratios were more predictive than absolute concentrations, consistent with plant nutrient-uptake dynamics
Model comparison.
- Decision tree - interpretable baseline with useful feature-importance output
- Random forest - best overall validation performance; reduced variance over the single tree
- XGBoost - marginal improvement with regularization on the larger feature set
- Shallow neural network (2 hidden layers, ReLU) - underperformed the tree ensembles. Insufficient data volume for non-linear feature learning to overcome the inductive bias of gradient-boosted trees on tabular data
Feature-importance findings. Nitrogen concentration and cumulative light exposure (daily light integral, DLI) were the two highest-impact controllable variables. Temperature contributed meaningfully but showed diminishing returns beyond a crop-specific comfort range. The point of this analysis was not just a yield number - it was to identify where intervention would pay off: the model pointed at nutrient dosing and supplemental lighting as the variables most worth controlling.
Planning integration. Yield predictions were chained to a time-series demand forecast (exponential smoothing on historical weekly sales by crop type). The output was a weekly recommendation matrix - crop type × recommended planting quantity - sized to meet 4-week-ahead demand within the predicted yield range. That is the step that turns a yield model into a planning tool.
Stack
Python, Scikit-learn, XGBoost, TensorFlow, Pandas, NumPy, Matplotlib