How to implement Zero-Inflated Multinomial?

Hi y’all!
I’m working on multinomial count data, but there is zero-inflation for some categories. I wonder how to handle that with PyMC.

Imagine you have several subjects, several observations per subject, and covariates for each observation. At each observation, subjects have to choose between several categories, but sometimes, one or more of the categories are not available for people to choose - this is deliberate, not a mistake.
As a result, you have a mixture of processes:

  • one where some categories have 0 counts because people couldn’t choose them
  • one with a classical multinomial process between all categories (which can also produce zeros, although very rarely when there is lots of trials)

My goal is to infer in a regression the latent probability of each category, without it being biased by the excess zeros. I feel like a zero-inflated multinomial would be appropriate here, but maybe so could be a censored-data model with the pm.Mixture class?

I’m not very experienced in these types of models, so I’m really curious about your take! Also, if the ZIMultinomial is the way to go, I noticed it’s not built into PyMC. I would be interested in doing a PR to add it if you find it useful :blush:
Thanks in advance & PyMCheers :vulcan_salute:


Interesting question, here is some of my thoughts:

Are the none-available categories known? If so I would build a Categorical/Multinomial RV as the observed, with the p containing 0s for those none-available categories.
In general, if these information is available you dont need to model it with a mixture distribution. If these information is not available, I would also start with assuming all 0 response being not available categories.


Thx Junpeng!

Yes, they are known, but they vary from one observation to the other, so with your suggestion I guess I would need several p vectors, one for each configuration?

To be more precise, here is a sample of the data:

Categories 1, 3 and 4 are the more likely to be not available for choice (I don’t have all of the data yet but I know that). Category 7 is an aggregation of “all other choices”, so it’s always available - i.e the zeros are real multinomial zeros, while zeros in cat1…6 reflect non-availability.

As you can see, the non-available categories vary, so I don’t really know how to say that to the model… My goal is to model these data as a hierarchical regression, with subject_type as cluster (and probably date for a second-stage refinement).
Thank you very much in advance!

I got all my data and can confirm that the zero-inflated categories are mostly 3 and 4: 25% of zeros in the sample – category 1’s zero-inflation is near-negligible, with about 5% zeros.
I read lots of papers, and it seems to be an active area of research! From what I understood, there are two options - none of which I know how to implement in PyMC:

  1. Generalizing Diallo et al., for every observation i, we assume that given the total C_{1i} + C_{2i} + ... + C_{6i} + C_{7i} = N_i, the multinomial response C_i = (C_{1i}, C_{2i}, ..., C_{7i}) is generated from the model:
\begin{equation} C_i \sim \begin{cases} (C_{1i}, C_{2i}, 0, C_{4i}, ..., C_{7i}), & \text{with probability $\pi$}\\ (C_{1i}, C_{2i}, C_{3i}, 0, ..., C_{7i}), & \text{with probability $\phi$}\\ mult(N_i, p_i), & \text{with probability $1- \pi - \phi$} \end{cases} \end{equation}
\text{where}\ p_i = (p_{1i}, ..., p_{7i}) \ \text{and}\ \Sigma_j(p_{ji}) = 1
  1. Reverse the thinking and model the non-zero-inflated categories (2, 5 and 6) as baseline-inflated: because they were available all the time, some of their numbers are inflated compared to their true proportion, when all categories are available. Thus, following Cai et al., if p(k|\theta) denotes a multinomial PMF with 7 categories, the model’s likelihood is:
\begin{equation} p(Y_i = k) = \begin{cases} \pi_k +(1 - \pi_2 - \pi_5 - \pi_6)\ p(k|\theta), & \text{if $k = 2, 5, 6$}\\ (1 - \pi_2 - \pi_5 - \pi_6)\ p(k|\theta), & \text{otherwise} \end{cases} \end{equation}

where k is the kth category in the multinomial distribution, and \pi_k is the probability of inflation of category k.

I prefer option 1 (more intuitive, less convoluted), but which is easier to implement in PyMC? How would I even go about it?
A huge thanks in advance for your help :champagne:
(of course, please tell me if there is any mistake above)

I would go with something like 2, with the assumption that the missing choices are by design (or known a-prior). This means that you dont need to infer the probability of the categories being missing like option 1. I am still thinking about what is the best way to implemented it. I am considering 2 ways:

  1. build the p as usually, using a mask to set the unavailable ones to 0, and do a normalization for each row. This is straightforward but I think you will ran into problem during sampling because the normalization
  2. treat each unique missing as independent, use a for-loop to build a multinomial for each case. You can index into your weight matrix (betas) to exclude the missing categories. It should work in general but if you have a lot of unique combination of missing, then there will be problem building the model (the theano graph becomes too big).
1 Like

Thank you very much Junpeng, that’s very helpful!
What’s still blurry in my mind is whether this would require custom log-p and density dist functions, or if multiple likelihoods in the model would suffice – I guess this will be clearer when beginning writing the code.
I’m gonna try your suggestion ASAP this week and keep you posted :wink:

1 Like

Ok so I think I managed to write an intercept-only model that accounts for the missing categories, thanks to your advice n°1 @junpenglao!

However, before claiming victory, I’d like to add a slope and test the model against simulated data. To do that, I’m using an MvNormal, so that the model accounts for potential covariation between intercepts and slopes:

with pm.Model() as m_slope:
    # cov matrix:
    sd_dist = pm.Exponential.dist(1)
    packed_chol = pm.LKJCholeskyCov('chol_cov', eta=4, n=2, sd_dist=sd_dist, shape=Ncategories - 1)
    chol = pm.expand_packed_triangular(2, packed_chol, lower=True, shape=Ncategories - 1)
    # Extract rho and sds:
    cov =, chol.T)
    sigma_ab = pm.Deterministic('sigma_district', tt.sqrt(tt.diag(cov)))
    corr = tt.diag(sigma_ab**-1).dot(**-1)))
    r = pm.Deterministic('Rho', corr[np.triu_indices(2, k=1)])
    # means:
    ab = pm.Normal('ab', mu=[-1.8, -0.5] , sd=[0.1, 0.2], shape=(Ncategories - 1, 2))   
    # varying effects:
    ab_cluster = pm.MvNormal('ab_cluster', mu=ab, chol=chol, shape=(Nclusters, Ncategories - 1, 2))

    mu = ab_cluster[cluster_id, :, 0] + ab_cluster[cluster_id, :, 1] * predictor

My goal is to get the mu matrix with all clusters and categories, that I can then pass to softmax (not shown here). But how do I get there, as pm.LKJCholeskyCov doesn’t accept a shape argument? :thinking:
Is there an elegant solution, or do I have to do a loop to construct the mu matrix progressively?
Thank you very much in advance for your help :tada:

Unfortunately you will need to use a loop, because the cholesky factorization is not batchable (if I remember correctly).

1 Like

Otherwise another possible solution is to construct a large blockwise covariance matrix.

Ok thanks Junpeng, I’ll try the loop then :wink:
I think a large blockwise covariance matrix is what I’m looking for, but I don’t know how to do that :sweat_smile:

Ok what do you think of this?

with pm.Model() as m_slope:
    mus_cats = []
    for p in range(Ncategories - 1):
        sd_dist = pm.Exponential.dist(1)
        packed_chol = pm.LKJCholeskyCov(f"chol_cov_{p}", eta=4, n=2, sd_dist=sd_dist)
        chol = pm.expand_packed_triangular(2, packed_chol, lower=True)
        cov =, chol.T)

        # extract rho and sds:
        cov =, chol.T)
        sigma_ab = pm.Deterministic(f"sigma_cluster_{p}", tt.sqrt(tt.diag(cov)))
        corr = tt.diag(sigma_ab ** -1).dot( ** -1)))
        r = pm.Deterministic(f"Rho_{p}", corr[np.triu_indices(2, k=1)])

        ab = pm.Normal(
            f"ab_{p}", mu=np.array([-1.8, -0.5]), sd=np.array([0.1, 0.2]), shape=2

        # varying effects:
        ab_cluster = pm.MvNormal(
            f"ab_cluster{p}", mu=ab, chol=chol, shape=(Nclusters, 2)

            ab_cluster[cluster_id, 0]
            + ab_cluster[cluster_id, 1] * predictor
    mus_cats = tt.as_tensor_variable(mus_cats).T

    # append last category:
    vary_pivot = tt.as_tensor_variable(np.full(shape=(Nclusters, 1), fill_value=-2.2))
    mus_cats = tt.horizontal_stack(mus_cats, vary_pivot[cluster_id])

    # preferences in each cluster:
    lat_p = tt.nnet.softmax(mus_cats)

    # keep only preferences for available categories:
    cat_prob = available_cats * lat_p
    # normalize preferences:
    cat_prob = cat_prob / tt.sum(cat_prob, axis=1, keepdims=True)
    R = pm.Multinomial("R", n=N, p=cat_prob, observed=sim_R)

There is no divergences yet, but sampling is reeeeeaaaaaally slow, so I’m probably doing something wrong :thinking: Note that the intercept-only model samples very fast, so I think it’s adding the covariance structure that slows things up.
Do you see a way to optimize the above implementation?
Thank you very much in advance :champagne:

In the end, this model took 3 hours and a half to sample :laughing:
There are no divergences, but I’m trying to implement a non-centered version of the MvNormal – I suspect it to make sampling harder, as the intercept-only model samples in 2 minutes. That would mean that the line:

# varying effects:
ab_cluster = pm.MvNormal(f"ab_cluster{p}", mu=ab, chol=chol, shape=(Nclusters, 2))


z = pm.Normal(f"z_{p}", 0., 1., shape=(Nclusters, 2))
vals_raw = ab + z
ab_cluster = pm.Deterministic(f"ab_cluster{p}",, vals_raw.T).T)

Does this seem right to you? And are there other optimizations I could do to speed-up sampling in your opinion?
(I can share how I simulate the data if useful – didn’t do it here because it’s a big chunk of code not related to PyMC)

1 Like

A noncentered version makes more sense to me for sure :wink:

1 Like

I just ran this version and… it actually takes more time to sample :sweat_smile:
Did I make a mistake somewhere? Or it’s just that this model can’t be optimized anymore?

hmmm, is it takes more time for tuning or more time in general?

Mmmh, good question :thinking: Is there a way to check the tuning time in PyMC? I only see the total time, and it definitely takes more – 5h against 3h30 for the centered model.
Actually, the non-centered sampler did not even converge…
Does this information help diagnosing the problem?

There are a few ways to benchmark your model, for example you can try

1 Like

Thanks Junpeng! Just did that on the two models, but don’t really know how to interpret the output… Do you have any tips on where I should focus my attention?

Usually, I will look for the ops and check the ones that takes the longest time.
I have to say it is the easiest thing to do, as some times a small change in your model results in a completely different graph from theano, and I have yet to figure out why…

1 Like

Thanks Junpeng! So it seems that 80% of ops time come from (results of m_slope.profile(m_slope.logpt).summary()):

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  32.1%    32.1%       0.603s       1.00e-04s     Py    6000        6   Solve{A_structure='lower_triangular', lower=False, overwrite_A=False, overwrite_b=False}
  20.2%    52.3%       0.380s       3.17e-05s     Py    12000       12   AdvancedSubtensor
  11.1%    63.4%       0.209s       3.49e-05s     Py    6000        6   AdvancedIncSubtensor{inplace=False,  set_instead_of_inc=True}
   3.9%    67.3%       0.073s       1.22e-05s     Py    6000        6   ExtractDiag{offset=0, axis1=0, axis2=1, view=False}
   3.2%    70.5%       0.060s       1.20e-05s     C     5000        5   IncSubtensor{Inc;int64}
   3.0%    73.5%       0.056s       9.41e-06s     C     6000        6   IncSubtensor{InplaceInc;int64::}
   2.9%    76.4%       0.055s       9.13e-06s     C     6000        6   AdvancedIncSubtensor1{no_inplace,set}
   2.9%    79.3%       0.054s       9.01e-06s     C     6000        6   CumOp{None, add}

It looks like the classes LinearAlgebra, AdvancedSubtensor and AdvancedIncSubtensor are taking the most time, but I’m not sure how to improve that…
I see these two tips at the end of the output:

  - Try the Theano flag floatX=float32
  - Try installing amdlibm and set the Theano flag lib.amdlibm=True. This speeds up only some Elemwise operation.

Does it help in your experience – or would it in this case?
A huge thank you again for your time :champagne: