Thank you very much Junpeng, that’s very helpful!

What’s still blurry in my mind is whether this would require custom log-p and density dist functions, or if multiple likelihoods in the model would suffice – I guess this will be clearer when beginning writing the code.

I’m gonna try your suggestion ASAP this week and keep you posted

# How to implement Zero-Inflated Multinomial?

Ok so I think I managed to write an intercept-only model that accounts for the missing categories, thanks to your advice n°1 @junpenglao!

However, before claiming victory, I’d like to add a slope and test the model against simulated data. To do that, I’m using an MvNormal, so that the model accounts for potential covariation between intercepts and slopes:

```
with pm.Model() as m_slope:
# cov matrix:
sd_dist = pm.Exponential.dist(1)
packed_chol = pm.LKJCholeskyCov('chol_cov', eta=4, n=2, sd_dist=sd_dist, shape=Ncategories - 1)
chol = pm.expand_packed_triangular(2, packed_chol, lower=True, shape=Ncategories - 1)
# Extract rho and sds:
cov = pm.math.dot(chol, chol.T)
sigma_ab = pm.Deterministic('sigma_district', tt.sqrt(tt.diag(cov)))
corr = tt.diag(sigma_ab**-1).dot(cov.dot(tt.diag(sigma_ab**-1)))
r = pm.Deterministic('Rho', corr[np.triu_indices(2, k=1)])
# means:
ab = pm.Normal('ab', mu=[-1.8, -0.5] , sd=[0.1, 0.2], shape=(Ncategories - 1, 2))
# varying effects:
ab_cluster = pm.MvNormal('ab_cluster', mu=ab, chol=chol, shape=(Nclusters, Ncategories - 1, 2))
mu = ab_cluster[cluster_id, :, 0] + ab_cluster[cluster_id, :, 1] * predictor
```

My goal is to get the `mu`

matrix with all clusters and categories, that I can then pass to softmax (not shown here). But how do I get there, as `pm.LKJCholeskyCov`

doesn’t accept a `shape`

argument?

Is there an elegant solution, or do I have to do a loop to construct the `mu`

matrix progressively?

Thank you very much in advance for your help

Unfortunately you will need to use a loop, because the cholesky factorization is not batchable (if I remember correctly).

Otherwise another possible solution is to construct a large blockwise covariance matrix.

Ok thanks Junpeng, I’ll try the loop then

I think a large blockwise covariance matrix is what I’m looking for, but I don’t know how to do that

Ok what do you think of this?

```
with pm.Model() as m_slope:
mus_cats = []
for p in range(Ncategories - 1):
sd_dist = pm.Exponential.dist(1)
packed_chol = pm.LKJCholeskyCov(f"chol_cov_{p}", eta=4, n=2, sd_dist=sd_dist)
chol = pm.expand_packed_triangular(2, packed_chol, lower=True)
cov = pm.math.dot(chol, chol.T)
# extract rho and sds:
cov = pm.math.dot(chol, chol.T)
sigma_ab = pm.Deterministic(f"sigma_cluster_{p}", tt.sqrt(tt.diag(cov)))
corr = tt.diag(sigma_ab ** -1).dot(cov.dot(tt.diag(sigma_ab ** -1)))
r = pm.Deterministic(f"Rho_{p}", corr[np.triu_indices(2, k=1)])
ab = pm.Normal(
f"ab_{p}", mu=np.array([-1.8, -0.5]), sd=np.array([0.1, 0.2]), shape=2
)
# varying effects:
ab_cluster = pm.MvNormal(
f"ab_cluster{p}", mu=ab, chol=chol, shape=(Nclusters, 2)
)
mus_cats.append(
ab_cluster[cluster_id, 0]
+ ab_cluster[cluster_id, 1] * predictor
)
mus_cats = tt.as_tensor_variable(mus_cats).T
# append last category:
vary_pivot = tt.as_tensor_variable(np.full(shape=(Nclusters, 1), fill_value=-2.2))
mus_cats = tt.horizontal_stack(mus_cats, vary_pivot[cluster_id])
# preferences in each cluster:
lat_p = tt.nnet.softmax(mus_cats)
# keep only preferences for available categories:
cat_prob = available_cats * lat_p
# normalize preferences:
cat_prob = cat_prob / tt.sum(cat_prob, axis=1, keepdims=True)
R = pm.Multinomial("R", n=N, p=cat_prob, observed=sim_R)
```

There is no divergences yet, but sampling is reeeeeaaaaaally slow, so I’m probably doing something wrong Note that the intercept-only model samples very fast, so I think it’s adding the covariance structure that slows things up.

Do you see a way to optimize the above implementation?

Thank you very much in advance

In the end, this model took 3 hours and a half to sample

There are no divergences, but I’m trying to implement a non-centered version of the MvNormal – I suspect it to make sampling harder, as the intercept-only model samples in 2 minutes. That would mean that the line:

```
# varying effects:
ab_cluster = pm.MvNormal(f"ab_cluster{p}", mu=ab, chol=chol, shape=(Nclusters, 2))
```

becomes:

```
z = pm.Normal(f"z_{p}", 0., 1., shape=(Nclusters, 2))
vals_raw = ab + z
ab_cluster = pm.Deterministic(f"ab_cluster{p}", tt.dot(chol, vals_raw.T).T)
```

Does this seem right to you? And are there other optimizations I could do to speed-up sampling in your opinion?

*(I can share how I simulate the data if useful – didn’t do it here because it’s a big chunk of code not related to PyMC)*

A noncentered version makes more sense to me for sure

I just ran this version and… it actually takes more time to sample

Did I make a mistake somewhere? Or it’s just that this model can’t be optimized anymore?

hmmm, is it takes more time for tuning or more time in general?

Mmmh, good question Is there a way to check the *tuning* time in PyMC? I only see the *total* time, and it definitely takes more – 5h against 3h30 for the centered model.

Actually, the non-centered sampler did not even converge…

Does this information help diagnosing the problem?

There are a few ways to benchmark your model, for example you can try https://docs.pymc.io/notebooks/profiling.html

Thanks Junpeng! Just did that on the two models, but don’t really know how to interpret the output… Do you have any tips on where I should focus my attention?

Usually, I will look for the ops and check the ones that takes the longest time.

I have to say it is the easiest thing to do, as some times a small change in your model results in a completely different graph from theano, and I have yet to figure out why…

Thanks Junpeng! So it seems that 80% of ops time come from (results of `m_slope.profile(m_slope.logpt).summary()`

):

```
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
32.1% 32.1% 0.603s 1.00e-04s Py 6000 6 Solve{A_structure='lower_triangular', lower=False, overwrite_A=False, overwrite_b=False}
20.2% 52.3% 0.380s 3.17e-05s Py 12000 12 AdvancedSubtensor
11.1% 63.4% 0.209s 3.49e-05s Py 6000 6 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=True}
3.9% 67.3% 0.073s 1.22e-05s Py 6000 6 ExtractDiag{offset=0, axis1=0, axis2=1, view=False}
3.2% 70.5% 0.060s 1.20e-05s C 5000 5 IncSubtensor{Inc;int64}
3.0% 73.5% 0.056s 9.41e-06s C 6000 6 IncSubtensor{InplaceInc;int64::}
2.9% 76.4% 0.055s 9.13e-06s C 6000 6 AdvancedIncSubtensor1{no_inplace,set}
2.9% 79.3% 0.054s 9.01e-06s C 6000 6 CumOp{None, add}
```

It looks like the classes `LinearAlgebra`

, `AdvancedSubtensor`

and `AdvancedIncSubtensor`

are taking the most time, but I’m not sure how to improve that…

I see these two tips at the end of the output:

```
- Try the Theano flag floatX=float32
- Try installing amdlibm and set the Theano flag lib.amdlibm=True. This speeds up only some Elemwise operation.
```

Does it help in your experience – or would it in this case?

A huge thank you again for your time

Is your CPU AMD? I think amdlibm only apply for that.

What I found surprising is that some of the operation is not compile to c, but I have not look at any profiling result for a long time so I might be miss-informed

Pinging @aseyboldt to see if he has some tips re looking at profiling and speeding up ops

Mmmh, not sure how to answer – sorry, I am truly clueless about this stuff

Here is my machine’s info (it’s a Mac), hoping that it answers your question:

```
compiler : Clang 4.0.1 (tags/RELEASE_401/final)
system : Darwin
release : 15.6.0
machine : x86_64
processor : i386
CPU cores : 2
interpreter: 64bit
```

Mac only use Intel CPU I think - and you set up your environment using conda? (if so Intel MKL should be set up correctly)

Yes it’s a conda env – don’t know if useful but my Mac is quite old (mid-2009).

FYI, I just ran the same model, but without modeling the covariation btw intercepts and slopes and it took 10 minutes to sample. So it seems that it really is the addition of the covariation structure (MvNormal) that considerably slows sampling.

Just realized I can mark this as solved thanks to @junpenglao’s suggestion of considering that non-available categories have probability 0 – this implies we know in advance which categories are absent. Of course, I’m curious about @aseyboldt thoughts on profiling and speed-ups if he’s got any!

Here is a simplified version of the model, for the sake of readability and understanding how to implement the idea with PyMC:

```
with pm.Model() as m_multi:
a_cluster = pm.Normal("a_cluster", -1.8, 0.5, shape=(Nclusters, Ncategories - 1))
a_pivot = tt.as_tensor_variable(np.full(shape=(Nclusters, 1), fill_value=-2.2))
a_cluster_f = tt.horizontal_stack(a_cluster a_pivot)
lat_p = tt.nnet.softmax(a_cluster_f[cluster_id])
# zero-inflation process:
# keep only preferences for available categories:
slot_p = cats_available * lat_p
# normalize preferences:
slot_p = slot_p / tt.sum(slot_p, axis=1, keepdims=True)
R = pm.Multinomial("R", n=N, p=slot_p, observed=sim_R)
trace_multi = pm.sample(2000, tune=3000, init="adapt_diag")
```

Hope this will help future users!

For the curious out there, the real model is much more complicated (varying-effects and covariance) but I’ll open source it – as well as real and simulated data – once I’ve had time to clean it up!

And again, a big thank you to Junpeng Couldn’t have done it without your help! Now I owe you a second beer when you come to Paris