Censored Binomial GLM: Sampling fails

I am trying to build a Censored Binomial GLM to model purchasing data of a product with limited capacity.
Let’s say, we have the following data-generating process:

  • We have a product with limited capacity, i.e., there is only a limited number of items available
  • We have a varying number of customers interested in the product. Each customers buys the product with a certain probability
  • That probability depends on some explanatory variable (e.g. the price of the product) with a sigmoidal dependency, e.g. error function.

I would like to model this with a Binomial GLM that is censored with the upper limit being the number of available items.

Here is script simulating some data and the model definition:

import pymc as pm
import numpy as np
import pandas as pd
import altair as alt
from scipy.special import erfc

rng = np.random.RandomState(123)

# Simulate some data
capacity = 50
N = 1000 # number of samples
lam = 100
customers = rng.poisson(lam=lam, size=N)
x = np.random.uniform(low=0, high=1, size=N)
a = 10.0
b = -5.0

p = 0.5 * erfc(a*x + b)

y = np.clip(rng.binomial(n=customers, p=p), 0, capacity)

# Plotting with altair
data = pd.DataFrame({'x': x, 'p': p, 'customers': customers, 'y': y})
alt.Chart(data).mark_circle().encode(x='x', y='y')

The data looks like this:


Model definition

with pm.Model() as censored_binomial_glm:
    # Model parameters
    a = pm.Normal('a', 0.0, 10.0)
    b = pm.Normal('b', 0.0, 1.0)
    lam = pm.Exponential('lam', 0.1)

    p = 0.5 * pm.math.erfc(a*x + b)
    # We assume that the customer number is also observed
    N = pm.Poisson('num_customers', mu=lam, observed=customers)
    binomial_dist = pm.Binomial.dist(n=N, p=p)
    pm.Censored("y", binomial_dist, lower=0, upper=capacity, observed=y)

Then I try to fit the model

with censored_binomial_glm:
    trace = pm.sample(target_accept=0.999)

and up with a lot of divergences. I verified that the model works well for a very high capacity where we’re basically never hitting maximum capacity and the censoring doesn’t play any role.

Does anyone see an error in the model definition or modeling approach or do you have an advice on how to improve the parametrization, for instance?
Thanks in advance.

One issue I see os that N could be below the observations which would be impossible

Thank you for your reply. Actually, I don’t see how N could be below the observations in the model. The binomial distribution can also sample values <=N, can’t it?

I think @ricardoV94 meant that allowing N to be a random variable could lead to situations (during sampling) in which N was less than some observed value. So then your model might, roughly speaking, ask what binomial p would yield 12 successes out of 10 attempts.

Right, thank you, now I understood.
However, wouldn’t that be the case for any binomial GLM even without censoring? Because, without censoring and no clipping in the data, I can fit a Binomial GLM without problems.

In any case, I simplified the model to directly feed the observations into the binomial distribution, like this:

with pm.Model() as censored_binomial_glm:
    a = pm.Normal('a', 0.0, 10.0)
    b = pm.Normal('b', 0.0, 1.0)
    p = 0.5 * pm.math.erfc(a*x + b)
    binomial_dist = pm.Binomial.dist(n=customers, p=p)
    pm.Censored("y", binomial_dist, lower=0, upper=capacity, observed=y)

This doesn’t solve my problem with fitting the model, however. I still get a lot of divergences:

with censored_binomial_glm:
    trace = pm.sample(tune=2000, target_accept=0.999)

Leads to this:

I don’t think binomial glms with random N are very common. Have you considered a Poisson or Negative Binomial parametrized in terms of the mean and standard deviation instead?

Thank you for the idea. I tried to use a Censored Poisson GLM with the following code:

with pm.Model() as poisson_glm_censored:
    a = pm.HalfNormal('a', 10.0)
    b = pm.Normal('b', 0.0, 1.0)
    beta = lam * (0.5 * pm.math.erfc(a*x + b))
    poisson_dist = pm.Poisson.dist(beta)
    pm.Censored("y", poisson_dist, lower=0, upper=capacity, observed=y)

The prior simulations do look reasonable as far as I can see:

(blue dots = data points)

But the sampling completely fails, unfortunately:

with poisson_glm_censored:
    trace_censored = pm.sample(tune=4000, step=[pm.NUTS(target_accept=0.9,max_treedepth=30)])

(I started with default parameters for pm.sample() and tried different sets of parameters all leading to bad sampling results.)