Product Analytics: The Pitfalls No One Warns You About

Product analytics sounds straightforward: measure user behavior, run experiments, ship what works, cut what doesn’t. In practice, every step in that chain has failure modes that are easy to miss and expensive to ignore.

This is a list of the landmines, explained as specifically as possible, along with how to avoid each one.

Landmine 1: Survivorship Bias in A/B Tests

You run an A/B test. Variant B shows a 15% improvement in your target metric. You ship variant B. Six months later, the metric is back where it started.

One possible explanation: survivorship bias in user segments. A/B tests measure behavior of users who showed up during the test window. If variant B changes who shows up — attracting users who are inherently more engaged, or retaining churned users who would have left anyway — the metric improvement reflects population change, not genuine product improvement.

The fix: look beyond aggregate metrics. Segment results by user cohort (new users vs. returning), by acquisition channel, and by days-since-signup. If the effect is concentrated in one segment that has different baseline behavior, the test may be measuring selection rather than causation.

Also: check whether variants differ in traffic volume, not just conversion rate. If variant B gets fewer users than expected (maybe it was slower to load), the users who waited for it to load are already more engaged than average.

Landmine 2: Novelty Effects

Users engage with new things. When you ship a new feature, engagement often spikes — not because the feature is good, but because it’s new. The novelty effect decays over 2–6 weeks as users habituate.

If your A/B test runs for 1 week, a novelty spike in variant B will look like a real improvement. You ship it. The effect disappears.

The fix: run tests long enough to outlast the novelty window. For major UI changes, this means at least 3–4 weeks. Look at the temporal distribution of the treatment effect — if the advantage concentrates in the first few days and shrinks later, it’s probably novelty.

The inverse also exists: learning effects. New interfaces take time for users to master. A complex but better feature may underperform in a short test because users haven’t learned it yet.

Landmine 3: Goodhart’s Law in Metric Selection

“When a measure becomes a target, it ceases to be a good measure.” — Goodhart’s Law.

You optimize for click-through rate; the product team figures out that clickbait thumbnails increase CTR. You optimize for daily active users; the team adds notification spam that brings people back but destroys their experience. You optimize for revenue per session; the team raises prices, which helps revenue short-term but increases churn.

Every metric you target will be gamed — not necessarily intentionally, but inevitably, because teams optimize toward what they’re measured on.

The fix: never optimize a single metric in isolation. Use a constellation of metrics: a primary metric you’re trying to move, plus guardrail metrics you’re required not to harm. A feature that increases CTR but harms time-on-site or satisfaction scores fails even if the primary metric improves.

Periodically rotate the primary metric so no single metric becomes entrenched enough to structurally distort product decisions.

Landmine 4: Simpson’s Paradox — Aggregate vs Segment-Level Analysis

A treatment improves conversion for desktop users (+5%) and for mobile users (+3%), but the aggregate appears to show a decrease (-1%). How?

Simpson’s paradox: when a confounding variable is unevenly distributed between treatment and control, aggregate results can reverse the direction of each segment.

In this example: if variant B happened to get more mobile traffic (which has inherently lower conversion), the mobile traffic mix shift can drag the aggregate below variant A, even though B is better within each segment.

The fix: always disaggregate results by major segments before drawing conclusions. Check that segment composition is similar between variants. If it isn’t, use weighted aggregation that accounts for the composition difference.

Landmine 5: SUTVA Violations — When A Affects B’s Users

The Stable Unit Treatment Value Assumption (SUTVA) requires that a user’s outcome depends only on their own treatment assignment, not others’. This assumption breaks in social products.

Example: you A/B test a feature that makes it easier to send messages. Users in variant A (with the feature) send more messages to friends in variant B (without the feature). Variant B users receive more messages and become more active — contaminating the control group.

Network effects, shared content, social proof, and inventory competition (two-sided marketplaces) all create SUTVA violations.

The fix: cluster randomization — assign treatment at the level of the social cluster, not the individual user. Randomize by city, network neighborhood, or friend group instead of by user. This is more expensive (requires larger samples for the same power) but avoids contamination.

Alternatively, measure only isolatable outcomes that don’t travel across the network — but this often excludes the metrics you most care about.

Landmine 6: Not Running Experiments at All

The opposite failure mode: treating every product decision as a case study in causal reasoning and running experiments only when you have a clear hypothesis and sufficient traffic.

In practice, many teams either:

Ship without measuring (no experiment, no outcome tracking)
Measure but don’t experiment (observe correlations, assume causation)
Run underpowered experiments (too short or too small, no statistical significance)

All three lead to systematically wrong intuitions about what drives product outcomes.

The fix: instrument everything by default. Determine minimum detectable effect and required sample size before starting an experiment. Use a power calculator (most A/B testing platforms have one) and accept that small effects need large samples — don’t shortcut this.

For decisions too small to justify a full experiment: document the decision, write down the predicted effect, and track outcomes for 4 weeks post-launch. Retrospective evaluation isn’t causal but builds institutional knowledge about what kinds of intuitions tend to be right.

Landmine 7: The Causal Inference Gap

Machine learning models predict. They do not explain causation. This distinction matters enormously in product analytics.

A model that predicts user churn based on behavioral features tells you who will churn. It doesn’t tell you why they churn. Worse, intervening based on model predictions can fail:

The model predicts that users who read help documentation are less likely to churn. You add more help documentation. Churn doesn’t change — because help documentation access was a proxy for user engagement, not a cause of retention.

Prediction vs causation:

Prediction: given what you can observe, estimate a future outcome
Causal inference: given an intervention, estimate how the outcome changes relative to the counterfactual

Most ML models predict. Experiments estimate causation. If you want to know whether a product change improves retention, run an experiment. If you want to know which users are at risk, train a model. Don’t confuse the two.

When experiments aren’t feasible (you can’t A/B test a pricing change during a recession), causal inference methods — difference-in-differences, synthetic control, regression discontinuity — can extract causal estimates from observational data. These require more assumptions and expertise than a randomized experiment, but they’re tractable.

Landmine 8: Goals and Motivation as Unmeasured Variables

Social products are open systems with many unmeasured variables. The most important — user goals and motivations — are almost never directly observed.

Two users with identical session patterns might have completely different motivations for using a product. A feature that serves one group’s motivation well might be irrelevant or harmful to another’s. Aggregate metrics flatten this heterogeneity.

Practically, this means:

Segment by user intent when you can identify it (new user vs. power user, task-oriented vs. browsing)
Run user research in parallel with quantitative analysis — qualitative data fills gaps that clickstream data can’t
Be skeptical of any model that claims to predict behavior without proxying for motivation at all

Quantitative metrics tell you what happened. They rarely tell you why. The “why” requires understanding user goals, and that requires talking to users.

The Mental Model for Trustworthy Product Analytics

Before trusting any analytical result:

Could the result be explained by a change in population composition? (Survivorship bias, Simpson’s paradox)
Is the measured effect likely to persist? (Novelty effects, learning curves)
Is the metric the right proxy for what we actually want to improve? (Goodhart’s Law)
Does the intervention affect only the users it’s supposed to affect? (SUTVA, network effects)
Are we confusing prediction for causation? (ML model outputs vs. experiment results)

These questions don’t eliminate ambiguity — product analytics is irreducibly messy. But asking them consistently produces better decisions than taking metrics at face value.