Count Data Models and Probabilistic Forecasting

Most practitioners default to linear regression or a standard neural network for any prediction task. That works fine when your target variable is a continuous quantity like revenue or temperature. When your target variable is a count — number of purchases, doctor visits, insurance claims, failed machine cycles — the assumptions break down and so do the predictions.

Count data requires specialized treatment. Here’s why, and what to use instead.

When Is This Actually Count Data?

Count data has specific properties:

The dependent variable is a non-negative integer: y = 0, 1, 2, 3, …
The distribution is typically right-skewed: most observations are small, a few are large
The sample is concentrated on small values (if your “counts” are in the thousands, treat the variable as continuous)
Events occur independently over a time period

Real examples:

Number of products a customer buys in a month
Number of support tickets submitted per week
Number of insurance claims per policy per year
Number of machine failures per quarter
Number of doctor visits in a year

What makes these different from continuous prediction tasks: a linear regression can predict -3 doctor visits. A count model cannot. The functional form matters.

The Poisson Model

The foundational count data model. The Poisson distribution gives the probability of observing $y$ events as:

$$P(Y = y) = \frac{e^{-\mu}\mu^y}{y!}$$

where $\mu$ is the rate parameter — the expected count given the covariates.

In a Poisson regression, $\mu$ is modeled as:

$$\mu_i = \exp(x_i’\beta)$$

The log link ensures predictions are always non-negative, regardless of the values of $x$.

The Equidispersion Assumption

The Poisson distribution has one defining constraint: equidispersion — the mean equals the variance.

$$E(Y|x) = \text{Var}(Y|x) = \mu$$

This is mathematically elegant and empirically wrong for most real datasets. In practice, count data almost always exhibits overdispersion — the variance exceeds the mean. This happens when there’s unobserved heterogeneity: different units have different underlying rates that you can’t fully explain with your covariates.

Using Poisson when the data is overdispersed underestimates standard errors and produces overconfident predictions. Test for this before accepting Poisson as your model.

The Negative Binomial Model

The negative binomial model relaxes the equidispersion constraint. Its variance is:

$$\text{Var}(Y|x) = \mu + \alpha\mu^2$$

where $\alpha$ is the overdispersion parameter. When $\alpha = 0$, you get back to Poisson. When $\alpha > 0$ (the common case), the model accommodates extra variance.

Testing for Overdispersion

Fit the negative binomial model, which estimates $\alpha$. Test whether $\alpha$ is significantly different from zero:

$H_0: \alpha = 0$ (Poisson is appropriate)
$H_a: \alpha > 0$ (overdispersion, use negative binomial)
$H_a: \alpha < 0$ (underdispersion, rare in practice)

A significant overdispersion parameter is grounds to prefer negative binomial over Poisson.

Incidence Rate Ratios

Negative binomial (and Poisson) output incidence rate ratios (IRR) rather than raw coefficients. IRR = $\exp(\beta)$.

Interpretation: if IRR for age is 1.4, then a one-unit increase in age is associated with a 40% increase in the expected count, holding other variables constant.

Zero-Inflated Models

Some datasets have far more zeros than any Poisson or negative binomial model would predict. This is the excess zeros problem.

Example: survey of hiking trips in the past year. Many respondents answer zero — but for different reasons:

They dislike hiking and will never go (structural zeros)
They like hiking but just didn’t get around to it this year (sampling zeros)

A single Poisson model treats all zeros the same. That’s wrong — the two types of zeros come from fundamentally different processes.

Zero-Inflated Poisson (ZIP)

The ZIP model mixes two processes:

$$P(Y = 0) = f_1(0) + (1 - f_1(0)) \cdot f_2(0)$$ $$P(Y = y) = (1 - f_1(0)) \cdot f_2(y) \quad \text{for } y \geq 1$$

$f_1$ is a binary model (logistic regression) that generates structural zeros. $f_2$ is a Poisson model for the count process among non-structural-zero units.

Different covariates can drive the two processes: whether someone goes hiking at all might depend on whether they live near mountains; how often they go might depend on their fitness level.

Hurdle Models

Similar to zero-inflated but with a different generative story: a binary hurdle determines whether the count is zero or positive. Once the hurdle is crossed, a truncated count model predicts the positive count.

$$P(Y = 0) = f_1(0)$$ $$P(Y = y) = \frac{1 - f_1(0)}{1 - f_2(0)} \cdot f_2(y) \quad \text{for } y \geq 1$$

The distinction from zero-inflated models: in a hurdle model, the positive outcomes are generated by a separate process that has been truncated at zero. In a zero-inflated model, the count process itself can generate zeros.

For most practical purposes, both specifications give similar results. Choose based on which generative story fits your domain.

Bayesian Forecasting and Credible Intervals

A key advantage of probabilistic count models over point-prediction approaches: they produce probability distributions over outcomes, not just expected values.

This enables:

Credible intervals (Bayesian equivalent of confidence intervals): “We are 80% confident that Q3 claims will be between 120 and 190.” These are direct probability statements about the parameter, unlike frequentist confidence intervals.

Predictive distributions: the full distribution over future counts. Useful for inventory planning, resource allocation, and risk assessment — you need to know the tail of the distribution, not just the mean.

Decision-making under uncertainty: with a predictive distribution, you can optimize decisions directly. “Set safety stock to the 95th percentile of demand distribution” is a tractable decision rule. “Set safety stock to mean demand + 2 standard deviations” is an approximation that ignores distributional shape.

Credible vs Confidence Intervals

A frequentist 95% confidence interval means: if you repeated the data collection and estimation many times, 95% of the resulting intervals would contain the true parameter. The interval either contains the parameter or it doesn’t — probability doesn’t apply to the specific interval you computed.

A Bayesian 95% credible interval means: given the data you observed, there is a 95% probability that the true parameter lies in this interval. This is the intuitive interpretation most people wrongly ascribe to confidence intervals.

For decision support, credible intervals are more useful.

Practical Checklist

When you encounter count data:

Start with Poisson. Fit it, check for overdispersion.
If overdispersed: switch to negative binomial.
If excess zeros: check whether they’re structural (zero-inflated) or threshold-based (hurdle). Domain knowledge matters here.
For prediction uncertainty: use a Bayesian formulation to get predictive distributions. Stan, PyMC3, and scikit-learn’s BayesianRidge (for continuous) all support this.
Interpret with IRRs: don’t report raw coefficients — report $\exp(\beta)$ as incidence rate ratios.
Check calibration: does the predicted distribution match observed counts? Plot predicted vs observed count distributions.

Supply Chain Application

At Blue Yonder, we used negative binomial models for demand forecasting at the SKU level. Raw demand data is count-valued (number of units sold per day per store), heavily overdispersed (demand is lumpy — often zero, occasionally a large spike), and often zero-inflated (some SKUs go through stockout periods that look like zero demand but aren’t).

A naive approach using ARIMA or linear regression produced forecasts that were systematically too smooth — they averaged away the lumpiness that actually drives inventory planning. Switching to negative binomial models with store-level random effects captured the overdispersion and produced better safety stock recommendations.

The model choice isn’t academic. Getting the distributional form right directly improves the downstream decision.