How to implement Zero-Inflated Multinomial?

AlexAndorra · October 12, 2019, 6:00pm

Hi y’all!
I’m working on multinomial count data, but there is zero-inflation for some categories. I wonder how to handle that with PyMC.

Imagine you have several subjects, several observations per subject, and covariates for each observation. At each observation, subjects have to choose between several categories, but sometimes, one or more of the categories are not available for people to choose - this is deliberate, not a mistake.
As a result, you have a mixture of processes:

one where some categories have 0 counts because people couldn’t choose them
one with a classical multinomial process between all categories (which can also produce zeros, although very rarely when there is lots of trials)

My goal is to infer in a regression the latent probability of each category, without it being biased by the excess zeros. I feel like a zero-inflated multinomial would be appropriate here, but maybe so could be a censored-data model with the pm.Mixture class?

I’m not very experienced in these types of models, so I’m really curious about your take! Also, if the ZIMultinomial is the way to go, I noticed it’s not built into PyMC. I would be interested in doing a PR to add it if you find it useful
Thanks in advance & PyMCheers

junpenglao · October 13, 2019, 7:43am

Interesting question, here is some of my thoughts:

Are the none-available categories known? If so I would build a Categorical/Multinomial RV as the observed, with the p containing 0s for those none-available categories.
In general, if these information is available you dont need to model it with a mixture distribution. If these information is not available, I would also start with assuming all 0 response being not available categories.

AlexAndorra · October 13, 2019, 3:56pm

Thx Junpeng!

Yes, they are known, but they vary from one observation to the other, so with your suggestion I guess I would need several p vectors, one for each configuration?

To be more precise, here is a sample of the data:

Categories 1, 3 and 4 are the more likely to be not available for choice (I don’t have all of the data yet but I know that). Category 7 is an aggregation of “all other choices”, so it’s always available - i.e the zeros are real multinomial zeros, while zeros in cat1…6 reflect non-availability.

As you can see, the non-available categories vary, so I don’t really know how to say that to the model… My goal is to model these data as a hierarchical regression, with subject_type as cluster (and probably date for a second-stage refinement).
Thank you very much in advance!

AlexAndorra · October 16, 2019, 3:24pm

I got all my data and can confirm that the zero-inflated categories are mostly 3 and 4: 25% of zeros in the sample – category 1’s zero-inflation is near-negligible, with about 5% zeros.
I read lots of papers, and it seems to be an active area of research! From what I understood, there are two options - none of which I know how to implement in PyMC:

Generalizing Diallo et al., for every observation i, we assume that given the total C_{1i} + C_{2i} + ... + C_{6i} + C_{7i} = N_i, the multinomial response C_i = (C_{1i}, C_{2i}, ..., C_{7i}) is generated from the model:

\begin{equation} C_i \sim \begin{cases} (C_{1i}, C_{2i}, 0, C_{4i}, ..., C_{7i}), & \text{with probability $\pi$}\\ (C_{1i}, C_{2i}, C_{3i}, 0, ..., C_{7i}), & \text{with probability $\phi$}\\ mult(N_i, p_i), & \text{with probability $1- \pi - \phi$} \end{cases} \end{equation}

\text{where}\ p_i = (p_{1i}, ..., p_{7i}) \ \text{and}\ \Sigma_j(p_{ji}) = 1

Reverse the thinking and model the non-zero-inflated categories (2, 5 and 6) as baseline-inflated: because they were available all the time, some of their numbers are inflated compared to their true proportion, when all categories are available. Thus, following Cai et al., if p(k|\theta) denotes a multinomial PMF with 7 categories, the model’s likelihood is:

\begin{equation} p(Y_i = k) = \begin{cases} \pi_k +(1 - \pi_2 - \pi_5 - \pi_6)\ p(k|\theta), & \text{if $k = 2, 5, 6$}\\ (1 - \pi_2 - \pi_5 - \pi_6)\ p(k|\theta), & \text{otherwise} \end{cases} \end{equation}

where k is the kth category in the multinomial distribution, and \pi_k is the probability of inflation of category k.

I prefer option 1 (more intuitive, less convoluted), but which is easier to implement in PyMC? How would I even go about it?
A huge thanks in advance for your help
(of course, please tell me if there is any mistake above)

junpenglao · October 18, 2019, 8:34am

I would go with something like 2, with the assumption that the missing choices are by design (or known a-prior). This means that you dont need to infer the probability of the categories being missing like option 1. I am still thinking about what is the best way to implemented it. I am considering 2 ways:

build the p as usually, using a mask to set the unavailable ones to 0, and do a normalization for each row. This is straightforward but I think you will ran into problem during sampling because the normalization
treat each unique missing as independent, use a for-loop to build a multinomial for each case. You can index into your weight matrix (betas) to exclude the missing categories. It should work in general but if you have a lot of unique combination of missing, then there will be problem building the model (the theano graph becomes too big).

AlexAndorra · October 20, 2019, 4:02pm

Thank you very much Junpeng, that’s very helpful!
What’s still blurry in my mind is whether this would require custom log-p and density dist functions, or if multiple likelihoods in the model would suffice – I guess this will be clearer when beginning writing the code.
I’m gonna try your suggestion ASAP this week and keep you posted

AlexAndorra · October 25, 2019, 3:42pm

Ok so I think I managed to write an intercept-only model that accounts for the missing categories, thanks to your advice n°1 @junpenglao!

However, before claiming victory, I’d like to add a slope and test the model against simulated data. To do that, I’m using an MvNormal, so that the model accounts for potential covariation between intercepts and slopes:

with pm.Model() as m_slope:
    # cov matrix:
    sd_dist = pm.Exponential.dist(1)
    packed_chol = pm.LKJCholeskyCov('chol_cov', eta=4, n=2, sd_dist=sd_dist, shape=Ncategories - 1)
    chol = pm.expand_packed_triangular(2, packed_chol, lower=True, shape=Ncategories - 1)
    
    # Extract rho and sds:
    cov = pm.math.dot(chol, chol.T)
    sigma_ab = pm.Deterministic('sigma_district', tt.sqrt(tt.diag(cov)))
    corr = tt.diag(sigma_ab**-1).dot(cov.dot(tt.diag(sigma_ab**-1)))
    r = pm.Deterministic('Rho', corr[np.triu_indices(2, k=1)])
    
    # means:
    ab = pm.Normal('ab', mu=[-1.8, -0.5] , sd=[0.1, 0.2], shape=(Ncategories - 1, 2))   
    
    # varying effects:
    ab_cluster = pm.MvNormal('ab_cluster', mu=ab, chol=chol, shape=(Nclusters, Ncategories - 1, 2))

    mu = ab_cluster[cluster_id, :, 0] + ab_cluster[cluster_id, :, 1] * predictor

My goal is to get the mu matrix with all clusters and categories, that I can then pass to softmax (not shown here). But how do I get there, as pm.LKJCholeskyCov doesn’t accept a shape argument?
Is there an elegant solution, or do I have to do a loop to construct the mu matrix progressively?
Thank you very much in advance for your help

junpenglao · October 25, 2019, 5:24pm

Unfortunately you will need to use a loop, because the cholesky factorization is not batchable (if I remember correctly).

junpenglao · October 25, 2019, 5:26pm

Otherwise another possible solution is to construct a large blockwise covariance matrix.

AlexAndorra · October 25, 2019, 6:48pm

Ok thanks Junpeng, I’ll try the loop then
I think a large blockwise covariance matrix is what I’m looking for, but I don’t know how to do that

AlexAndorra · October 25, 2019, 8:46pm

Ok what do you think of this?

with pm.Model() as m_slope:
    mus_cats = []
    for p in range(Ncategories - 1):
        sd_dist = pm.Exponential.dist(1)
        packed_chol = pm.LKJCholeskyCov(f"chol_cov_{p}", eta=4, n=2, sd_dist=sd_dist)
        chol = pm.expand_packed_triangular(2, packed_chol, lower=True)
        cov = pm.math.dot(chol, chol.T)

        # extract rho and sds:
        cov = pm.math.dot(chol, chol.T)
        sigma_ab = pm.Deterministic(f"sigma_cluster_{p}", tt.sqrt(tt.diag(cov)))
        corr = tt.diag(sigma_ab ** -1).dot(cov.dot(tt.diag(sigma_ab ** -1)))
        r = pm.Deterministic(f"Rho_{p}", corr[np.triu_indices(2, k=1)])

        ab = pm.Normal(
            f"ab_{p}", mu=np.array([-1.8, -0.5]), sd=np.array([0.1, 0.2]), shape=2
        )

        # varying effects:
        ab_cluster = pm.MvNormal(
            f"ab_cluster{p}", mu=ab, chol=chol, shape=(Nclusters, 2)
        )

        mus_cats.append(
            ab_cluster[cluster_id, 0]
            + ab_cluster[cluster_id, 1] * predictor
        )
    mus_cats = tt.as_tensor_variable(mus_cats).T

    # append last category:
    vary_pivot = tt.as_tensor_variable(np.full(shape=(Nclusters, 1), fill_value=-2.2))
    mus_cats = tt.horizontal_stack(mus_cats, vary_pivot[cluster_id])

    # preferences in each cluster:
    lat_p = tt.nnet.softmax(mus_cats)

    # keep only preferences for available categories:
    cat_prob = available_cats * lat_p
    # normalize preferences:
    cat_prob = cat_prob / tt.sum(cat_prob, axis=1, keepdims=True)
    R = pm.Multinomial("R", n=N, p=cat_prob, observed=sim_R)

There is no divergences yet, but sampling is reeeeeaaaaaally slow, so I’m probably doing something wrong Note that the intercept-only model samples very fast, so I think it’s adding the covariance structure that slows things up.
Do you see a way to optimize the above implementation?
Thank you very much in advance

AlexAndorra · November 1, 2019, 2:32pm

In the end, this model took 3 hours and a half to sample
There are no divergences, but I’m trying to implement a non-centered version of the MvNormal – I suspect it to make sampling harder, as the intercept-only model samples in 2 minutes. That would mean that the line:

# varying effects:
ab_cluster = pm.MvNormal(f"ab_cluster{p}", mu=ab, chol=chol, shape=(Nclusters, 2))

becomes:

z = pm.Normal(f"z_{p}", 0., 1., shape=(Nclusters, 2))
vals_raw = ab + z
ab_cluster = pm.Deterministic(f"ab_cluster{p}", tt.dot(chol, vals_raw.T).T)

Does this seem right to you? And are there other optimizations I could do to speed-up sampling in your opinion?
(I can share how I simulate the data if useful – didn’t do it here because it’s a big chunk of code not related to PyMC)

junpenglao · November 1, 2019, 4:16pm

A noncentered version makes more sense to me for sure

AlexAndorra · November 1, 2019, 5:06pm

I just ran this version and… it actually takes more time to sample
Did I make a mistake somewhere? Or it’s just that this model can’t be optimized anymore?

junpenglao · November 1, 2019, 6:46pm

hmmm, is it takes more time for tuning or more time in general?

AlexAndorra · November 1, 2019, 8:04pm

Mmmh, good question Is there a way to check the tuning time in PyMC? I only see the total time, and it definitely takes more – 5h against 3h30 for the centered model.
Actually, the non-centered sampler did not even converge…
Does this information help diagnosing the problem?

junpenglao · November 2, 2019, 7:41am

There are a few ways to benchmark your model, for example you can try https://docs.pymc.io/notebooks/profiling.html

AlexAndorra · November 2, 2019, 1:00pm

Thanks Junpeng! Just did that on the two models, but don’t really know how to interpret the output… Do you have any tips on where I should focus my attention?

junpenglao · November 3, 2019, 7:44am

Usually, I will look for the ops and check the ones that takes the longest time.
I have to say it is the easiest thing to do, as some times a small change in your model results in a completely different graph from theano, and I have yet to figure out why…

AlexAndorra · November 3, 2019, 11:16am

Thanks Junpeng! So it seems that 80% of ops time come from (results of m_slope.profile(m_slope.logpt).summary()):

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  32.1%    32.1%       0.603s       1.00e-04s     Py    6000        6   Solve{A_structure='lower_triangular', lower=False, overwrite_A=False, overwrite_b=False}
  20.2%    52.3%       0.380s       3.17e-05s     Py    12000       12   AdvancedSubtensor
  11.1%    63.4%       0.209s       3.49e-05s     Py    6000        6   AdvancedIncSubtensor{inplace=False,  set_instead_of_inc=True}
   3.9%    67.3%       0.073s       1.22e-05s     Py    6000        6   ExtractDiag{offset=0, axis1=0, axis2=1, view=False}
   3.2%    70.5%       0.060s       1.20e-05s     C     5000        5   IncSubtensor{Inc;int64}
   3.0%    73.5%       0.056s       9.41e-06s     C     6000        6   IncSubtensor{InplaceInc;int64::}
   2.9%    76.4%       0.055s       9.13e-06s     C     6000        6   AdvancedIncSubtensor1{no_inplace,set}
   2.9%    79.3%       0.054s       9.01e-06s     C     6000        6   CumOp{None, add}

It looks like the classes LinearAlgebra, AdvancedSubtensor and AdvancedIncSubtensor are taking the most time, but I’m not sure how to improve that…
I see these two tips at the end of the output:

  - Try the Theano flag floatX=float32
  - Try installing amdlibm and set the Theano flag lib.amdlibm=True. This speeds up only some Elemwise operation.

Does it help in your experience – or would it in this case?
A huge thank you again for your time

Topic		Replies	Views
Modelling Categorial Variable Questions	2	706	October 17, 2019
Multinomial hierarchical regression with multiple observations per group ("Bad energy issue") Questions	22	3385	March 14, 2019
Literature on implementing a Zero-Inflated Beta likelihood Questions	19	2329	January 18, 2019
Marginalizing out a categorical variable Questions	18	2237	April 1, 2021
Indexing using a free random variable Questions	26	6442	October 16, 2019

How to implement Zero-Inflated Multinomial?

Related topics