Slowness in multinomial softmax regression

Hi all! Are there any news on this front (either with ADVI or other methods)?
I think I’m confronted to a similar problem with a multinomial softmax regression: The model samples smoothly - no divergences and chains look good - although very slowly (2 hours and a half for 8000 samples on my machine). Priors have been carefully checked with prior predictive checks.

I first thought that sampling was hurt by some of the categories spending time near zero, but after some tests, I think the biggest burden on sampling speed is the amount of data (not that big, but high dimensions: 1889 rows for 7 categories).

So my question is two-fold:

  • Although the chains look good and there are no divergences, is there still a way to optimize the model for sampling?
  • Or is it just MCMC having a hard time sampling in high dimensions? In that case, which solution do you recommend? Running on a virtual machine?

Thank you very much for your help :blush:

I moved your thread into a standalone post.

In your model, the softmax actually render a and sigma_dpt useless, as softmax normalized the vector p. Essentially, the slowness of the model is from the unidentifiable introduced by the softmax and p being over-parameterized.

What I would do for model like this would be something like this:

import theano.tensor as tt

with pm.Model() as m_itcpt_pot:
    a_dpt = pm.Normal("a_dpt", 0., 1., shape=(Ndpts, Nparties-1))
    a = pm.Deterministic("a", tt.mean(a_dpt, axis=0)) 
    sigma_dpt = pm.Deterministic("sigma_dpt", tt.std(a_dpt))
    p = tt.nnet.softmax(tt.horizontal_stack(tt.zeros((len(dpt_id), 1)), a_dpt[dpt_id]))
    R = pm.Multinomial("R", n=N, p=p, observed=R_obs)
    trace_itcpt_pot = pm.sample(1000, tune=500, init='adapt_diag')
az.summary(trace_itcpt_pot, round_to=2, var_names=["a", "sigma_dpt"])
1 Like

Thanks Junpeng! Although I don’t really understand why the softmax makes a and sigma_dpt useless, I think I get what you’re doing: a_dept now has 6 categories, so that p is no longer over-parametrized. And then, with tt.horizontal_stack I guess you’re recovering the last category?
Just tried it, works like a charm in less than 2 minutes :tada:

My concern though is that I seem to lost the hierarchical structure now, as the clusters’ population (a and sigma_dpt) is no longer estimated simultaneously with each cluster (`a_dept). As a result, there would be no shrinkage of the estimates. Did I miss something?
(Also, I’m not sure how to extend this model to varying slopes, but that’s for later :stuck_out_tongue_winking_eye:)

1 Like

There are a few posts on multinomial regression in discourse that you can search and take a look.

Thank you Junpeng! So, after hours of trials, errors and introspection, I think I managed to get the model running with the hierarchical structure:

with pm.Model() as m_itc_var:
    a = pm.Normal("a", -1.8, 0.1, shape=Nparties - 1)
    a_f = pm.Deterministic("a_f", tt.concatenate([a, [0]]))
    sigma_dpt = pm.Exponential("sigma_dpt", 1.0)

    a_dpt = pm.Normal("a_dpt", a, sigma_dpt, shape=(Ndpts, Nparties - 1))
    a_dpt_f = pm.Deterministic(
        "a_dpt_f", tt.horizontal_stack(a_dpt, tt.zeros(shape=(Ndpts, 1)))
    p = tt.nnet.softmax(a_dpt_f[dpt_id])

    R = pm.Multinomial("R", n=N, p=p, observed=R_obs)
    trace_itc_var = pm.sample(
        1000, tune=2000, cores=2, random_seed=RANDOM_SEED, init="adapt_diag"

Sampling is smooth, without divergence - note that I stacked the column of zeros at the end of the matrix instead of the beginning, which seems to slightly slow down sampling.

However, I don’t see how to do prior/posterior predictive checks now: artificially setting one of the categories to 0 biases the p vector, as the softmax doesn’t know the left-most zero is merely a placeholder. I don’t see a way to tell that to scipy’s softmax function. How can I then recover the real p vector?
As a sidenote, is that done under the hood in tt.nnet.softmax? Otherwise it seems to me that p and R will be biased - at least for prior checks.

After some reading, I found this very useful explanation, that seems to mathematically justify fixing one category to 0:

Not all of the \beta_k vectors of coefficients are uniquely identifiable. This is due to the fact that all probabilities must sum to 1, making one of them completely determined once all the rest are known. As a result, there are only K − 1 separately identifiable vectors of coefficients. One way to see this is to note that if we add a constant vector to all of the coefficient vectors, the equations are identical.
Essentially, we set the constant so that one of the vectors becomes 0, and all of the other vectors get transformed into the difference between those vectors and the vector we chose.

As it arises naturally along this process, the softmax transformation therefore doesn’t seem to bias the probability vector in favor of (or away from) the pivot category.
I also read about the same process in chapter 10 of McElreath’s Rethinking.

All of that is clearer for me now, but it’s not clear, so if someone has other reading recommendations, I’m all ears :vulcan_salute:

1 Like

What I usually do is to manually add a column of zeros to the trace of p and then use scipy’s softmax.

1 Like

Yeah that’s what I did, but I was concerned that would bias the p vector in favor of (if all the other scores are below 0) or away from (if all the other scores are above 0) the category that was artificially fixed to 0 - as the softmax function is not aware that the left (right)-most 0 is just a placeholder for the pivot category.
Apparently it does not, as the softmax comes at the end of the process: the K-1 regressions have already been run and the pivot category’s probability is therefore entirely determined.

1 Like