I’m trying to fit a binned regression model, based on the excellent example here. In my case the generative process is producing count data (specifically a number of absence days from work), and the counts are heavily over-dispersed, so I’m using a zero-inflated negative binomial rather than a normal distribution.
This is the basic setup:
import pandas as pd
import pymc as pm
import pytensor.tensor as pt
# The observed counts in each bin
counts_df = pd.DataFrame.from_dict({"0": 18832417,
"1-7": 7967105,
"8-14": 1571958,
"15-30": 1230221,
"31-180": 894474,
">180": 112582}, orient="index", columns=["count"])
bin_edges = [1, 8, 15, 31, 181]
# Mean of the prior for p_g, based on the expected mean of outcome
N = 256 # Working days in year
MU = 4.2 # Average, given in another data source. This should equal the mean of the distribution
p_bar = N / (MU+N) # Calculated value of the prior for p, the nbinom event success probability, based on the expected mean
psi_bar = (obs_counts_df.loc["0"] / obs_counts_df.sum()).item() # Calculated value for psi, the inflated probability of a zero, based on the observed data
with pm.Model() as model:
# Loose priors, with initial values calculated from the data where I hope the distributions will converge
p_g = pm.Beta("p_g", alpha=1, beta=1, initval=p_bar)
psi_g = pm.Beta("psi_g", alpha=1, beta=1, initval=psi_bar)
# Generative process
probs = pm.math.exp(pm.logcdf(pm.ZeroInflatedNegativeBinomial.dist(p=p_g, psi=psi_g, alpha=N), bin_edges))
probs = pm.math.concatenate([[0], probs, [1]])
probs = pt.extra_ops.diff(probs)
# Likelihood
pm.Multinomial("counts", p=probs, n=obs_counts_df.sum(), observed=obs_counts_df)
with model:
trace = pm.sample()
prior = pm.sample_prior_predictive(var_names=["p_g", "psi_g"])
post = pm.sample_posterior_predictive(trace)
idata = pm.to_inference_data(trace=trace)
idata.extend(prior)
idata.extend(post)
Even with the specified initial values, this errors out with:
SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'p_g_logodds__': array(4.79357683), 'psi_g_logodds__': array(0.34441228)}
Initial evaluation results:
{'p_g': -4.81, 'psi_g': -1.42, 'counts': -inf}
I assume that the model doesn’t plausibly describe the data, which is a problem. I tried generating some data using my expected model, and fitting to that, as follows:
import numpy as np
# Generated zero-inflated negative binomial data
size = 10000
bin_edges_inc = np.array([0, 1, 8, 15, 31, 181, N]) - 0.1
x_df = pd.DataFrame(np.random.negative_binomial(n=N, p=p_bar, size=size), columns=["x"])
x_df.loc[x_df.sample(int(size*psi_bar)).index.values] = 0
gen_counts = x_df["x"].value_counts(bins=bin_edges_inc).values
with pm.Model() as model_gen:
# Prior
p_g = pm.Beta("p_g", alpha=1, beta=1, initval=p_bar)
psi_g = pm.Beta("psi_g", alpha=1, beta=1, initval=psi_bar)
# Generative process
probs = pm.math.exp(pm.logcdf(pm.ZeroInflatedNegativeBinomial.dist(p=p_g, psi=psi_g, alpha=N), bin_edges))
probs = pm.math.concatenate([[0], probs, [1]])
probs = pt.extra_ops.diff(probs)
# Likelihood
pm.Multinomial("counts", p=probs, n=gen_counts.sum(), observed=gen_counts)
with model_gen:
gen_trace = pm.sample()
gen_prior = pm.sample_prior_predictive(var_names=["p_g", "psi_g"])
gen_post = pm.sample_posterior_predictive(gen_trace)
gen_idata = pm.to_inference_data(trace=gen_trace)
gen_idata.extend(gen_prior)
gen_idata.extend(gen_post)
And indeed, this seems to progress more smoothly, apart from warnings about reaching the maximum tree depth.
This brings me to my first question Q1: does this show that I was right about the model not being plausible generator for the observed data?
Moving on, when I try to plot the trace for the generated model with az.plot_trace(gen_idata)
I get an error: OverflowError: cannot convert float infinity to integer
. Q2: what’s causing this?
Finally, when I use the plotting method from the example, tweaked slightly for API updates:
fix, ax = plt.subplots()
pd.Series(gen_counts).plot(kind="bar", ax=ax)
gen_idata.posterior_predictive.plot.scatter(x="counts_dim_2", y="counts", alpha=0.5)
_ = ax.set_xticklabels(["0", "1-7", "8-14", "15-30", "31-180", ">180"])
_ = ax.set_title(f"Generated data\nPosterior predictive")
I get this beautiful plot of the posterior predictive samples, with dots supposedly coloured according to the draw they’re taken from. However, the colorbar seems strange - there seem to be almost all yellow dots. Q3: Is there a way to get arviz.plot_ppc()
to produce a similar plot?
Thanks for reading!