Hi,
I am trying to build a Censored Binomial GLM to model purchasing data of a product with limited capacity.
Let’s say, we have the following data-generating process:
- We have a product with limited capacity, i.e., there is only a limited number of items available
- We have a varying number of customers interested in the product. Each customers buys the product with a certain probability
- That probability depends on some explanatory variable (e.g. the price of the product) with a sigmoidal dependency, e.g. error function.
I would like to model this with a Binomial GLM that is censored with the upper limit being the number of available items.
Here is script simulating some data and the model definition:
import pymc as pm
import numpy as np
import pandas as pd
import altair as alt
from scipy.special import erfc
rng = np.random.RandomState(123)
# Simulate some data
capacity = 50
N = 1000 # number of samples
lam = 100
customers = rng.poisson(lam=lam, size=N)
x = np.random.uniform(low=0, high=1, size=N)
a = 10.0
b = -5.0
p = 0.5 * erfc(a*x + b)
y = np.clip(rng.binomial(n=customers, p=p), 0, capacity)
# Plotting with altair
data = pd.DataFrame({'x': x, 'p': p, 'customers': customers, 'y': y})
alt.Chart(data).mark_circle().encode(x='x', y='y')
The data looks like this:
Model definition
with pm.Model() as censored_binomial_glm:
# Model parameters
a = pm.Normal('a', 0.0, 10.0)
b = pm.Normal('b', 0.0, 1.0)
lam = pm.Exponential('lam', 0.1)
p = 0.5 * pm.math.erfc(a*x + b)
# We assume that the customer number is also observed
N = pm.Poisson('num_customers', mu=lam, observed=customers)
binomial_dist = pm.Binomial.dist(n=N, p=p)
pm.Censored("y", binomial_dist, lower=0, upper=capacity, observed=y)
Then I try to fit the model
with censored_binomial_glm:
trace = pm.sample(target_accept=0.999)
and up with a lot of divergences. I verified that the model works well for a very high capacity where we’re basically never hitting maximum capacity and the censoring doesn’t play any role.
Does anyone see an error in the model definition or modeling approach or do you have an advice on how to improve the parametrization, for instance?
Thanks in advance.