Issues with specification of coefficients in Hierarchichal Softmax Regression

Hello,

I am trying to fit a Hierarchical Softmax Regression Model, but I am struggling with specifying the beta coefficients.

If I create a list of betas for each participant (option #3) I can get the model to run, but if instead I try to specify the ‘equivalent’ shape inside a pm.Normal (option #1), or to stack a bunch of pm.Normal in a pm.Deterministic (option #2) I cannot get the model to run.

I have the feeling that either specification would correspond to the same model, but I must be wrong. The reason I would prefer options #1 and #2 is that it makes it easier to plot the posterior distributions as well as making the graphviz model tidier (for option #1).

with pm.Model() as m_pp:    # partial pooling 
    beta = pm.Normal('beta', 0, 10, shape=(len(weights)))  
    spread = pm.Exponential('spread', .1, shape=len(weights))
    
    # These do not work
    # betas = pm.Normal('betas', beta, spread, shape=(participants, len(weights)))
    # betas = pm.Deterministic('betas', tt.stack([pm.Normal(f'beta[{i}]', beta, spread, shape=(len(weights))) for i in range(participants)]))

    # This does
    betas = [pm.Normal(f'beta[{i}]', beta, spread, shape=(len(weights))) for i in range(participants)]

    mu1 = pm.Deterministic('mu1', tt.concatenate([pm.math.dot(x1[i], betas[i]) for i in range(participants)]))
    mu2 = pm.Deterministic('mu2', tt.concatenate([pm.math.dot(x2[i], betas[i]) for i in range(participants)]))
    mu3 = pm.Deterministic('mu3', tt.concatenate([pm.math.dot(x3[i], betas[i]) for i in range(participants)]))
    mus = pm.Deterministic('mus', tt.transpose(tt.stack((mu1, mu2, mu3))))
    
    theta = pm.Deterministic('theta', tt.nnet.softmax(mus))
    yl = pm.Multinomial('y', n=1, p=theta, observed=y)
    
    trace_pp = pm.sample(500, target_accept=.85)
pm.model_to_graphviz(m_pp, )

I just wrote a notebook extending Osvaldo Martin’s multinomial regression example – I think this will help you.
Two potential issues that can cause sampling problems here:

  • I think your priors are not regularizing enough – the exponential in the softmax distorts the outcome space completely
  • You seem to have three categories in total. If this is the case, then the model is over-parametrized (the last category’s probability can be derived from the other 2)

I show how to deal with both types of issues in the NB :wink:

I am still playing with the priors, but I don’t see how it explains why one definition of the coefficients works and the others fail, does it? I have tried multiple priors and I always find that setting the betas with #3 works and the others does not.

I am not sure why the exponential hyperprior for the scale parameter would ‘distort’ anything. It is merely used to define the pooling strenght between individual coefficients.

I have been thinking about the overparametrization for a while. I am not sure it affects my model, but I am not sure it does not either. Notice that the model is unusual in that the coefficients are shared across the 3 classes, and what changes are the features of each class. I did not find examples before using this approach and so I am not sure that removing one of the classes is the correct decision/ is a problem at all… Whenever I set one class to zero (which in my case is achieved by setting X1 to zero, not the betas) the model seems to run at the same speed and arrive at the same conclusions.

Thank you for your input so far. That notebook seems helpful in general, although it does not seem to address my specific problems.

Oh yeah, sorry if I were unclear: the values of the priors don’t make the model fail here. It was just an anticipation on my part: a softmax link forces regularizing priors on you, otherwise sampling can be very hard / impossible.

The linked NB should help you with that – there are multiple predictors and categories, so it’s a pretty general case. Doing the prior predictive checks only with a logit link and one predictor is usually helpful (as your priors for each categories are the same anyway).

Sorry, I wasn’t clear: the exponential hyperprior is not the issue, it’s the softmax link (which contains exponentials). But you would have the same problem with any other non-linear link function (logit, log…): these create floor and ceiling effects which create a distortion between the parameter space and the outcome space.
E.g. logistic(0) = 0.5 and logistic(1) = 0.73, but logistic(4) = 0.98 and logistic(5) = 0.99. In both cases, your parameter increases by one, but your outcome (a probability here) increases a lot less in the second case – because an already very probable event can’t really get that much more probable. The same thing is happening with a softmax link, only on all the categories. If you’re still uncomfortable with that, just do prior predictive checks, they are invaluable.

If you put all the categories in the linear equations, the model will be over-parametrized. Note that this does not break any statistical rule – it can just make sampling (very) inefficient, so it’s something to have in mind. In addition to the first NB, here is another one, where both kinds of softmax regression are shown (starts at Code 10.56)

In my simulations I have found the sampling to be as efficient in this case as when setting the N-1 categories model. In addition, I find the output more useful because it extracts the original coefficients.

The same seems to be the case with the model you shared with me in the Jupyter Notebook (that Notebook is very useful btw). Here is my simulation of that same model. You can see that the second model runs as fast as the first one, and that it correctly retrieves the original beta coefficient (which the first does not as expected, since the model is parameterized differently). I am starting to think that when the coefficients are shared it makes less sense to pivot the categories.

I agree, the trick of N-1 categories always feels weird (I use these models very often and my brain always complains :laughing:) and makes the parameters even harder to interpret. So I’d say that when sampling the over-parametrized model is not an issue, you should use this one – your model will be easier to share and understand.

I don’t want to be too definitive here, but I think this doesn’t matter. Bottom line is multinomial regression is mathematically defined by N-1 categories, just as a binomial model just needs 1 out of the 2 probabilities (you could put both, but that wouldn’t be efficient).
Sometimes you can get away with over-parametrization: your data are small, or they are arranged in a kind way for you, just by chance. But one of these days they won’t, and then the N-1 trick is useful. In other words, don’t use it if don’t have to – but sometimes you have to :tired_face:

Yes, let me narrow down my claim. In my model it seems to be particularly okay, but this is due to the type of input/data I have. For instance, most of my input variables are binary variables which can only be true for one of the 3 categories in each observation.

1 Like