How to implement Zero-Inflated Multinomial?

Thank you very much Junpeng, that’s very helpful!
What’s still blurry in my mind is whether this would require custom log-p and density dist functions, or if multiple likelihoods in the model would suffice – I guess this will be clearer when beginning writing the code.
I’m gonna try your suggestion ASAP this week and keep you posted :wink:

1 Like

Ok so I think I managed to write an intercept-only model that accounts for the missing categories, thanks to your advice n°1 @junpenglao!

However, before claiming victory, I’d like to add a slope and test the model against simulated data. To do that, I’m using an MvNormal, so that the model accounts for potential covariation between intercepts and slopes:

with pm.Model() as m_slope:
    # cov matrix:
    sd_dist = pm.Exponential.dist(1)
    packed_chol = pm.LKJCholeskyCov('chol_cov', eta=4, n=2, sd_dist=sd_dist, shape=Ncategories - 1)
    chol = pm.expand_packed_triangular(2, packed_chol, lower=True, shape=Ncategories - 1)
    # Extract rho and sds:
    cov =, chol.T)
    sigma_ab = pm.Deterministic('sigma_district', tt.sqrt(tt.diag(cov)))
    corr = tt.diag(sigma_ab**-1).dot(**-1)))
    r = pm.Deterministic('Rho', corr[np.triu_indices(2, k=1)])
    # means:
    ab = pm.Normal('ab', mu=[-1.8, -0.5] , sd=[0.1, 0.2], shape=(Ncategories - 1, 2))   
    # varying effects:
    ab_cluster = pm.MvNormal('ab_cluster', mu=ab, chol=chol, shape=(Nclusters, Ncategories - 1, 2))

    mu = ab_cluster[cluster_id, :, 0] + ab_cluster[cluster_id, :, 1] * predictor

My goal is to get the mu matrix with all clusters and categories, that I can then pass to softmax (not shown here). But how do I get there, as pm.LKJCholeskyCov doesn’t accept a shape argument? :thinking:
Is there an elegant solution, or do I have to do a loop to construct the mu matrix progressively?
Thank you very much in advance for your help :tada:

Unfortunately you will need to use a loop, because the cholesky factorization is not batchable (if I remember correctly).

1 Like

Otherwise another possible solution is to construct a large blockwise covariance matrix.

Ok thanks Junpeng, I’ll try the loop then :wink:
I think a large blockwise covariance matrix is what I’m looking for, but I don’t know how to do that :sweat_smile:

Ok what do you think of this?

with pm.Model() as m_slope:
    mus_cats = []
    for p in range(Ncategories - 1):
        sd_dist = pm.Exponential.dist(1)
        packed_chol = pm.LKJCholeskyCov(f"chol_cov_{p}", eta=4, n=2, sd_dist=sd_dist)
        chol = pm.expand_packed_triangular(2, packed_chol, lower=True)
        cov =, chol.T)

        # extract rho and sds:
        cov =, chol.T)
        sigma_ab = pm.Deterministic(f"sigma_cluster_{p}", tt.sqrt(tt.diag(cov)))
        corr = tt.diag(sigma_ab ** -1).dot( ** -1)))
        r = pm.Deterministic(f"Rho_{p}", corr[np.triu_indices(2, k=1)])

        ab = pm.Normal(
            f"ab_{p}", mu=np.array([-1.8, -0.5]), sd=np.array([0.1, 0.2]), shape=2

        # varying effects:
        ab_cluster = pm.MvNormal(
            f"ab_cluster{p}", mu=ab, chol=chol, shape=(Nclusters, 2)

            ab_cluster[cluster_id, 0]
            + ab_cluster[cluster_id, 1] * predictor
    mus_cats = tt.as_tensor_variable(mus_cats).T

    # append last category:
    vary_pivot = tt.as_tensor_variable(np.full(shape=(Nclusters, 1), fill_value=-2.2))
    mus_cats = tt.horizontal_stack(mus_cats, vary_pivot[cluster_id])

    # preferences in each cluster:
    lat_p = tt.nnet.softmax(mus_cats)

    # keep only preferences for available categories:
    cat_prob = available_cats * lat_p
    # normalize preferences:
    cat_prob = cat_prob / tt.sum(cat_prob, axis=1, keepdims=True)
    R = pm.Multinomial("R", n=N, p=cat_prob, observed=sim_R)

There is no divergences yet, but sampling is reeeeeaaaaaally slow, so I’m probably doing something wrong :thinking: Note that the intercept-only model samples very fast, so I think it’s adding the covariance structure that slows things up.
Do you see a way to optimize the above implementation?
Thank you very much in advance :champagne:

In the end, this model took 3 hours and a half to sample :laughing:
There are no divergences, but I’m trying to implement a non-centered version of the MvNormal – I suspect it to make sampling harder, as the intercept-only model samples in 2 minutes. That would mean that the line:

# varying effects:
ab_cluster = pm.MvNormal(f"ab_cluster{p}", mu=ab, chol=chol, shape=(Nclusters, 2))


z = pm.Normal(f"z_{p}", 0., 1., shape=(Nclusters, 2))
vals_raw = ab + z
ab_cluster = pm.Deterministic(f"ab_cluster{p}",, vals_raw.T).T)

Does this seem right to you? And are there other optimizations I could do to speed-up sampling in your opinion?
(I can share how I simulate the data if useful – didn’t do it here because it’s a big chunk of code not related to PyMC)

1 Like

A noncentered version makes more sense to me for sure :wink:

1 Like

I just ran this version and… it actually takes more time to sample :sweat_smile:
Did I make a mistake somewhere? Or it’s just that this model can’t be optimized anymore?

hmmm, is it takes more time for tuning or more time in general?

Mmmh, good question :thinking: Is there a way to check the tuning time in PyMC? I only see the total time, and it definitely takes more – 5h against 3h30 for the centered model.
Actually, the non-centered sampler did not even converge…
Does this information help diagnosing the problem?

There are a few ways to benchmark your model, for example you can try

1 Like

Thanks Junpeng! Just did that on the two models, but don’t really know how to interpret the output… Do you have any tips on where I should focus my attention?

Usually, I will look for the ops and check the ones that takes the longest time.
I have to say it is the easiest thing to do, as some times a small change in your model results in a completely different graph from theano, and I have yet to figure out why…

1 Like

Thanks Junpeng! So it seems that 80% of ops time come from (results of m_slope.profile(m_slope.logpt).summary()):

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  32.1%    32.1%       0.603s       1.00e-04s     Py    6000        6   Solve{A_structure='lower_triangular', lower=False, overwrite_A=False, overwrite_b=False}
  20.2%    52.3%       0.380s       3.17e-05s     Py    12000       12   AdvancedSubtensor
  11.1%    63.4%       0.209s       3.49e-05s     Py    6000        6   AdvancedIncSubtensor{inplace=False,  set_instead_of_inc=True}
   3.9%    67.3%       0.073s       1.22e-05s     Py    6000        6   ExtractDiag{offset=0, axis1=0, axis2=1, view=False}
   3.2%    70.5%       0.060s       1.20e-05s     C     5000        5   IncSubtensor{Inc;int64}
   3.0%    73.5%       0.056s       9.41e-06s     C     6000        6   IncSubtensor{InplaceInc;int64::}
   2.9%    76.4%       0.055s       9.13e-06s     C     6000        6   AdvancedIncSubtensor1{no_inplace,set}
   2.9%    79.3%       0.054s       9.01e-06s     C     6000        6   CumOp{None, add}

It looks like the classes LinearAlgebra, AdvancedSubtensor and AdvancedIncSubtensor are taking the most time, but I’m not sure how to improve that…
I see these two tips at the end of the output:

  - Try the Theano flag floatX=float32
  - Try installing amdlibm and set the Theano flag lib.amdlibm=True. This speeds up only some Elemwise operation.

Does it help in your experience – or would it in this case?
A huge thank you again for your time :champagne:

Is your CPU AMD? I think amdlibm only apply for that.
What I found surprising is that some of the operation is not compile to c, but I have not look at any profiling result for a long time so I might be miss-informed :sweat_smile:
Pinging @aseyboldt to see if he has some tips re looking at profiling and speeding up ops

1 Like

Mmmh, not sure how to answer – sorry, I am truly clueless about this stuff :sweat_smile:
Here is my machine’s info (it’s a Mac), hoping that it answers your question:

compiler   : Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 15.6.0
machine    : x86_64
processor  : i386
CPU cores  : 2
interpreter: 64bit

Mac only use Intel CPU I think - and you set up your environment using conda? (if so Intel MKL should be set up correctly)

Yes it’s a conda env – don’t know if useful but my Mac is quite old (mid-2009).
FYI, I just ran the same model, but without modeling the covariation btw intercepts and slopes and it took 10 minutes to sample. So it seems that it really is the addition of the covariation structure (MvNormal) that considerably slows sampling.

Just realized I can mark this as solved thanks to @junpenglao’s suggestion of considering that non-available categories have probability 0 – this implies we know in advance which categories are absent. Of course, I’m curious about @aseyboldt thoughts on profiling and speed-ups if he’s got any!
Here is a simplified version of the model, for the sake of readability and understanding how to implement the idea with PyMC:

with pm.Model() as m_multi:
    a_cluster = pm.Normal("a_cluster", -1.8, 0.5, shape=(Nclusters, Ncategories - 1))
    a_pivot = tt.as_tensor_variable(np.full(shape=(Nclusters, 1), fill_value=-2.2))
    a_cluster_f = tt.horizontal_stack(a_cluster a_pivot)

    lat_p = tt.nnet.softmax(a_cluster_f[cluster_id])

    # zero-inflation process:
    # keep only preferences for available categories:
    slot_p = cats_available * lat_p
    # normalize preferences:
    slot_p = slot_p / tt.sum(slot_p, axis=1, keepdims=True)
    R = pm.Multinomial("R", n=N, p=slot_p, observed=sim_R)

    trace_multi = pm.sample(2000, tune=3000, init="adapt_diag")

Hope this will help future users!
For the curious out there, the real model is much more complicated (varying-effects and covariance) but I’ll open source it – as well as real and simulated data – once I’ve had time to clean it up!
And again, a big thank you to Junpeng :champagne: Couldn’t have done it without your help! Now I owe you a second beer when you come to Paris :wink: