Feature engineering is where most ML projects spend most of their time. It’s also where subtle bugs live that don’t surface until production — when the features your model sees at serving time differ slightly from the features it saw at training time, and accuracy quietly degrades.
The core question when designing a feature pipeline: where should features be computed, and how do you ensure consistency between training and serving? For Google Cloud ML workloads, this usually comes down to BigQuery versus TensorFlow Transform (TFX).
What Each Tool Is
BigQuery
A fully managed, serverless analytics data warehouse. You write SQL; Google’s infrastructure figures out how to execute it at petabyte scale. No cluster management, no infrastructure sizing — just SQL and pricing based on data scanned.
Key characteristics:
- Standard SQL interface (familiar, accessible to non-engineers)
- Serverless: scales automatically, no capacity planning
- Very fast for batch queries over large datasets stored in BigQuery
- Not designed for low-latency online serving
- Stateless by default — computes features on demand from raw data
TensorFlow Transform (TFT)
A preprocessing library that integrates Apache Beam (Google’s distributed data processing framework) with TensorFlow. It lets you define transformations in Python that run:
- As a Beam pipeline during training (computing full-pass statistics over the dataset)
- As a TensorFlow graph during serving (applying those same statistics at inference time)
The critical feature: the transformation logic and any statistics computed during training (means, standard deviations, vocabulary sizes) are embedded into the exported TensorFlow SavedModel. Serving applies exactly the same transformations the model was trained with.
The Core Tradeoff: Skew vs Simplicity
Training-serving skew is what happens when features computed at training time differ from features computed at serving time. Common causes:
- Different code paths (Pandas in training, Java in production)
- Different statistical windows (training uses 30-day stats computed on batch data; serving uses a real-time approximation)
- Preprocessing bugs that affect one pipeline but not the other
Skew is insidious because it often causes only a small performance drop — enough to miss if you’re not looking for it, but enough to matter at scale.
The BigQuery-vs-TFT decision is fundamentally: how much risk of skew are you willing to accept, and what do you gain in exchange?
When BigQuery Wins
Batch scoring at scale: if your use case is generating predictions on a large dataset (hourly batch scoring, overnight recommendation generation), BigQuery is excellent. Write the feature engineering SQL once; run the same SQL for both training data preparation and batch inference. No skew because it’s literally the same code.
Rapid iteration: SQL is fast to write and fast to debug. For early-stage projects where you’re still figuring out which features matter, BigQuery lets you experiment without building pipeline infrastructure.
Stateless transformations: features like log(revenue), is_weekend, days_since_signup, ratio of two columns — these don’t require statistics computed over the training set. They can be computed identically in SQL at any time.
Non-ML analytics teams: if your company runs on BigQuery and your analysts write SQL, keeping feature engineering in SQL makes your pipeline legible to people who don’t know Python or TensorFlow.
When TFT Wins
Full-pass transformations that need consistency: normalization (subtract mean, divide by std), vocabulary encoding (map categories to integer IDs), and similar transformations require statistics computed over the training set. During serving, you must apply the training-set statistics — not re-compute from fresh data, which would change the feature values.
TFT computes these statistics during training and stores them in the model artifact. Serving retrieves them from the artifact. It’s impossible to get skew from this class of transformation.
Online prediction (low latency): if you need to serve predictions in under 100ms in response to user actions, BigQuery is too slow — you can’t run a BigQuery query per request. TFT-embedded preprocessing runs inside the model serving container, adding microseconds, not seconds.
Embedded preprocessing as a correctness guarantee: with TFT, the serving code literally cannot use different preprocessing than the training code — they’re both part of the same SavedModel. This is a stronger guarantee than code review or documentation.
The Decision Matrix
| Scenario | BigQuery | Dataflow + TFT | TF input_fn |
|---|---|---|---|
| Stateless transformations, batch scoring | ✓ Best | ✓ OK | ✗ Not recommended |
| Stateful transforms (mean/std), batch scoring | ✗ Not recommended | ✓ Best | ✗ Not possible |
| Stateless transforms, online prediction | ✗ Too slow | ✓ OK | ✗ Not recommended |
| Stateful transforms, online prediction | ✗ Not recommended | ✓ Best | ✗ Not possible |
| Real-time window aggregations | ✗ N/A | ✓ OK (streaming) | ✗ Not possible |
Practical Approach: Start with BigQuery, Graduate to TFT
A pattern that works well in practice:
Phase 1 — exploration and baseline: compute all features in BigQuery SQL. Extract training data as a CSV or BigQuery table. Train your model. This phase is about finding the right features quickly.
Phase 2 — productionization: identify which features require full-pass statistics (normalization, vocabulary encoding). Move those specific transformations to TFT. Keep stateless transformations in BigQuery if batch scoring is acceptable.
Phase 3 — online serving: if you need online prediction, move the entire preprocessing chain into TFT so it runs inside the model server. Use BigQuery for training data preparation only.
This graduated approach avoids over-engineering early while reaching the right architecture for production requirements.
The Blue Yonder Experience
At Blue Yonder, we maintained two parallel feature pipelines for demand forecasting — a BigQuery-based one for batch training data and a Dataflow-based one for online scoring. The two pipelines diverged over 18 months of independent development to the point where the “same” features differed enough to cause measurable model degradation.
The fix — migrating to TFT where preprocessing was embedded in the model artifact — eliminated the skew and made the model’s behavior at serving time verifiably identical to its behavior at training time. The migration cost was significant. The operational stability improvement was worth it.
The lesson: for stateless transformations and pure batch workloads, BigQuery is simpler and often better. For anything involving full-pass statistics or online serving, TFT’s training-serving consistency guarantee pays for itself in avoided debugging time.