Multinomial hierarchical regression with multiple observations per group ("Bad energy issue")


#1

Hi everyone! To try my hand on hierarchical models with PyMC, I’m trying to adapt @twiecki’s model to a multinomial case with 7 continuous regressors encoded in the variable X. The multinomial observations are in y, with 7 possible categories.
There are ~2000 observations in total, broke down by counties (~85), which constitute the groups in the model. Thus, there are multiple observations per group.
Here is my work in progress:

with pm.Model() as multinom_multivariate:
   
    mu_a = pm.Normal('mu_a', mu=0., sd=100.)
    sigma_a = pm.HalfCauchy('sigma_a', 5.)
    mu_b = pm.Normal('mu_b', mu=0., sd=100.)
    sigma_b = pm.HalfCauchy('sigma_b', 5.)   
   
    a = pm.Normal('a', mu=mu_a, sd=sigma_a, shape=(n_counties, n_categs))
    b = pm.Normal('b', mu=mu_b, sd=sigma_b, shape=(n_counties, n_categs, n_regressors))

    results_est = a[counties_idx, :] + pm.math.sum(b[counties_idx, :, :] * np.expand_dims(X, axis=1), axis=2)
    p = pm.Deterministic('p', tt.nnet.softmax(results_est))
   
    likelihood = pm.Multinomial('likelihood', n=y.sum(axis=1), p=p, observed=y)
   
    trace = pm.sample(cores=4)

I can’t find what’s misspecified, although it obviously is, since I can’t even sample:


I went back and checked: my X and y don’t have any NaNs or 0s, so I can’t see why there would be an infinite value in the log-probability…
Anyone has any thought? That would be much appreciated! A huge thanks in advance :pray:


#2

Have you try the solution in FAQ?

Also, some times there is problem with the init method, so try trace = pm.sample(..., init='adapt_diag') as well.


#3

Thanks for your answer Junpeng Lao! Yeah I tried them all (although not sure I could pass obj_optimizer to pm.sample but it didn’t raise any error): none fixed the issue… Same for init='adapt_diag

But looking into likelihood.tag.test_value, I noticed there are 7 values (on circ. 2000) that are <0! Which is troubling for a Multinomial :thinking:
Will investigate and come back to you :wink:


#4

So, after much investigation: turns out my raw data were so corrupt that they contained 7 negative values! So imaginable to me that I didn’t even consider this possibility, but took care of removing NaNs and 0s :joy:

Thus, the “Bad energy issue is gone”, but I’m now in quite a pickle, since the trace stops at ~15%, telling me that the “mass matrix contains zeros on the diagonal”:


I tried reparametrizing the model to a non-centered version (as advised here) and diminished the hyperpriors (as advised here), but the issue is still there.
Any ideas on what to do next? I’m quite lost here…


#5

This is usually issue of the hyper prior being too flat, which results gradient being too small. Try:

    mu_a = pm.Normal('mu_a', mu=0., sd=1.)
    sigma_a = pm.HalfNormal('sigma_a', 2.5)
    mu_b = pm.Normal('mu_b', mu=0., sd=1.)
    sigma_b = pm.HalfNormal('sigma_b', 2.5)   

Also, because of the softmax your model is unidentifiable, so you really need to place strong hyperprior on your parameters.


#6

Thanks a lot! Looks like it’s working: the trace is currently at 22% and running :vulcan_salute:
Although it’s take forever to sample (3 hours left), so I suspect the model is misspecified in some way…

Also, I get the gist of it, but can you elaborate on what you mean by “unidentifiable” please? And your switching from a Half Cauchy to a Half Normal for the sigmas aims to get stronger hyperpriors right?


#7

Unidentifiable is referring to the normalization from Softmax - you might have two vector that are different but being the same after softmax.

The HalfCauchy to HalfNormal is because HalfCauchy is too heavy tail and difficult to sample.


#8

Yeah ok, I think I got it :wink:
Thanks a lot @junpenglao, reactive and helpful, as always!
Will keep you posted!


#9

Well that’s unexpected: the model runs with half Cauchy, but not with half Normal (“mass matrix contains zeros on the diagonal”). Any clue why? (I’m running a non-centered version of the model above)

Also, it took 10 hours to sample 24 000 samples and did not converge. It’s a sign I should completely rethink the model, isn’t it?


#10

Do you see any divergence? How are the chains look like?


#11

Yeah there are divergences + the sampler did not converge + GR statistics is > 1.4 + effective number of sample is <200 for some parameters, so basically nothing you’re hoping for :sweat_smile:
I can’t show you the chains because I made the mistake of not saving the trace before reruning the model with an increased target_accept… which never sampled :expressionless:

But more broadly, I wondered if having hyperpriors centered around 0 made sense in this case (the p in the Multinomial can’t have values close to 0 or 1 in reality), so I changed them randomly:
mu_a = pm.Normal('mu_a', mu=2., sd=1.)
sigma_a = pm.HalfNormal('sigma_a', 2.5)
mu_b = pm.Normal('mu_b', mu=3.9, sd=1.)
sigma_b = pm.HalfNormal('sigma_b', 1.)

Although the softmax transformation makes it hard to interpret the “philosophical” meaning of a, b and their hyperpriors, this dramatically improved sampling speed! So I guess I’m on the right track, but I still get the mass matrix error, so still can’t explore the chains:

I don’t know if there is something special to it, but this 2854th derivative of b is always among those valued at zero, whatever the prior for mu_b. But I don’t understand where this 0 comes from… Do you think it has to do with the regressors (e.g my X vector)?
If it’s of any use, here is the model’s test point:
test_point
Huge thanks in advance! :pray:


#13

Definitely some problem with that row of data (input).


#14

Yeah I’m gonna investigate that, but 2 things are bugging me a priori:

  1. My X vector doesn’t have 2854 rows: its shape is (1889, 7)…

  2. When I switch mu_b's mean from 3.9 to e.g. 9.0, other 0 derivatives appear, which means the problem is fluid (don’t know if it matters, but the input data are standardized):


#15

So, the error being quite vague, I rewrote the model with just 1 obs per cluster & 1 regressor:

mu_a = pm.Normal('mu_a', mu=3., sd=1.)
sigma_a = pm.HalfNormal('sigma_a', 2.)
mu_b = pm.Normal('mu_b', mu=4., sd=1.)
sigma_b = pm.HalfNormal('sigma_b', 2.)

a = pm.Normal('a', mu=mu_a, sd=sigma_a, shape=(n_counties, n_categs))
b = pm.Normal('b', mu=mu_b, sd=sigma_b, shape=(n_counties, n_categs))

results_est = a + b * np.expand_dims(X.reg1, axis=1)
probs = pm.Deterministic('probs', tt.nnet.softmax(results_est))
likelihood = pm.Multinomial('likelihood', n=y.sum(axis=1), p=probs, observed=y)

This does sample, but very inefficiently and the trace does not converge, as demonstrated by the chains:


So I tried a non-centered version:

a = pm.Normal('a', mu=0., sd=1., shape=n_categs)
b = pm.Normal('b', mu=0., sd=1., shape=n_categs)
sigma_counties_a = pm.HalfNormal('sigma_counties_a', 2., shape=n_categs)
sigma_counties_b = pm.HalfNormal('sigma_counties_b', 2., shape=n_categs)

a_counties = pm.Normal('a_counties', mu=0., sd=1., shape=(n_counties, n_categs))
b_counties = pm.Normal('b_counties', mu=0., sd=1., shape=(n_counties, n_categs))

A = a + a_counties * sigma_counties_a
B = b + b_counties * sigma_counties_b

results_est = A + B * np.expand_dims(X.reg1, axis=1)
probs = pm.Deterministic('probs', tt.nnet.softmax(results_est))
likelihood = pm.Multinomial('likelihood', n=y.sum(axis=1), p=probs, observed=y)

And… the mass matrix error is back! :partying_face: Any idea of what I’m missing my dear @junpenglao? (sorry for the long post, I wanted to test lots of things before answering)

Multiprocess sampling (4 chains in 4 jobs)
NUTS: [b_counties, a_counties, sigma_counties_b, sigma_counties_a, b, a]
Sampling 4 chains:  23%|██▎       | 2810/12000 [02:06<44:56,  3.41draws/s]

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/anaconda/envs/cdf/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 73, in run
    self._start_loop()
  File "/anaconda/envs/cdf/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 113, in _start_loop
    point, stats = self._compute_point()
  File "/anaconda/envs/cdf/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 139, in _compute_point
    point, stats = self._step_method.step(self._point)
  File "/anaconda/envs/cdf/lib/python3.6/site-packages/pymc3/step_methods/arraystep.py", line 247, in step
    apoint, stats = self.astep(array)
  File "/anaconda/envs/cdf/lib/python3.6/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 115, in astep
    self.potential.raise_ok(self._logp_dlogp_func._ordering.vmap)
  File "/anaconda/envs/cdf/lib/python3.6/site-packages/pymc3/step_methods/hmc/quadpotential.py", line 201, in raise_ok
    raise ValueError('\n'.join(errmsg))
ValueError: Mass matrix contains zeros on the diagonal. 
The derivative of RV `b_counties`.ravel()[30] is zero.
The derivative of RV `b_counties`.ravel()[63] is zero.
The derivative of RV `b_counties`.ravel()[90] is zero.
The derivative of RV `b_counties`.ravel()[114] is zero.
The derivative of RV `b_counties`.ravel()[122] is zero.
The derivative of RV `b_counties`.ravel()[142] is zero.
The derivative of RV `b_counties`.ravel()[143] is zero.
The derivative of RV `b_counties`.ravel()[154] is zero.
The derivative of RV `b_counties`.ravel()[163] is zero.
The derivative of RV `b_counties`.ravel()[170] is zero.
The derivative of RV `b_counties`.ravel()[175] is zero.
The derivative of RV `b_counties`.ravel()[184] is zero.
The derivative of RV `b_counties`.ravel()[197] is zero.
The derivative of RV `b_counties`.ravel()[203] is zero.
The derivative of RV `b_counties`.ravel()[207] is zero.
The derivative of RV `b_counties`.ravel()[223] is zero.
The derivative of RV `b_counties`.ravel()[226] is zero.
The derivative of RV `b_counties`.ravel()[240] is zero.
The derivative of RV `b_counties`.ravel()[259] is zero.
The derivative of RV `b_counties`.ravel()[260] is zero.
The derivative of RV `b_counties`.ravel()[261] is zero.
The derivative of RV `b_counties`.ravel()[276] is zero.
The derivative of RV `b_counties`.ravel()[277] is zero.
The derivative of RV `b_counties`.ravel()[343] is zero.
The derivative of RV `b_counties`.ravel()[361] is zero.
The derivative of RV `b_counties`.ravel()[383] is zero.
The derivative of RV `b_counties`.ravel()[384] is zero.
The derivative of RV `b_counties`.ravel()[428] is zero.
The derivative of RV `b_counties`.ravel()[429] is zero.
The derivative of RV `b_counties`.ravel()[436] is zero.
The derivative of RV `b_counties`.ravel()[458] is zero.
The derivative of RV `b_counties`.ravel()[465] is zero.
The derivative of RV `b_counties`.ravel()[473] is zero.
The derivative of RV `b_counties`.ravel()[502] is zero.
The derivative of RV `b_counties`.ravel()[506] is zero.
The derivative of RV `b_counties`.ravel()[514] is zero.
The derivative of RV `b_counties`.ravel()[544] is zero.
The derivative of RV `b_counties`.ravel()[560] is zero.
The derivative of RV `b_counties`.ravel()[562] is zero.
The derivative of RV `b_counties`.ravel()[566] is zero.
The derivative of RV `b_counties`.ravel()[616] is zero.
The derivative of RV `b_counties`.ravel()[633] is zero.
The derivative of RV `b_counties`.ravel()[646] is zero.
The derivative of RV `a_counties`.ravel()[38] is zero.
The derivative of RV `a_counties`.ravel()[52] is zero.
The derivative of RV `a_counties`.ravel()[59] is zero.
The derivative of RV `a_counties`.ravel()[142] is zero.
The derivative of RV `a_counties`.ravel()[402] is zero.
The derivative of RV `a_counties`.ravel()[437] is zero.
The derivative of RV `a_counties`.ravel()[444] is zero.
The derivative of RV `a_counties`.ravel()[472] is zero.
The derivative of RV `a_counties`.ravel()[486] is zero.
The derivative of RV `a_counties`.ravel()[510] is zero.
The derivative of RV `a_counties`.ravel()[513] is zero.
The derivative of RV `a_counties`.ravel()[534] is zero.
The derivative of RV `b`.ravel()[2] is zero.
"""

#16

It is a bit difficult to say without the data - what is it looks like when you use a non-hierarchical model?


#17

Your wish is my command @junpenglao: here is a gist :wink:
(let me know if something’s missing or messed up, but normally you can access everything)
Thank you! :pray:


#18

Overall, your model is over-specified.

So my approach to these kind of problem is usually the following:

  1. Understand the structure of the input. I usually do a pairplot of the predictors. In this case, it is clear that you have predictors that are common among departments, and you have N departments (usually treated as random effect)

  2. Start with a simple model that capture the structure of the data.
    In your case, since it is a multinomial regression, you need to be careful re normalization in the softmax. I find that the most stable way is to model k-1 categories and stack a columns of 0s at the end:

# make sure there is no missing data here
groupX = X_full.loc[idx['seinemaritime',:], :].iloc[:, :n_regressors-1].values

Ndpt = len(np.unique(dptmts_idx))
date_idx, date_names = X_full.index.get_level_values(1).factorize(sort=True)

with pm.Model() as hier_model:
    dpt_interp = pm.Normal('dpt_interp', 0., 1., shape=(Ndpt, n_parties-1))
    fixed_effect = pm.Normal('fixed', 0., 1., shape=(n_regressors-1, n_parties-1))
    random_effect = pm.Normal('random', 0., 1., shape=n_parties-1)
    
    results_est = dpt_interp[dptmts_idx] +\
                tt.dot(groupX, fixed_effect)[date_idx]+\
                X_full['chomage'].values[:, None]*random_effect
    results_est = tt.concatenate(tensor_list=[results_est, 
                                              tt.zeros((X_full.shape[0], 1))],
                                 axis=1)
    
    probs = pm.Deterministic('probs', tt.nnet.softmax(results_est))
    likelihood = pm.Multinomial('likelihood',
                                n=y_full.sum(axis=1),
                                p=probs,
                                observed=y_full.values)
    # might need to increase tunning
    trace = pm.sample(1000, tune=1000, cores=4, random_seed=RANDOM_SEED)
  1. add hyperpriors. I dont think it is too useful to put hyperprior on the mus of the coefficients (again, given the softmax), but it might help to put prior on the sigmas (even hierarchical ones)

#19

That’s awesome, thank you Junpeng Lao!
I’ll try that ASAP and come back to you :wink:
(sorry for the late answer, I was on holiday)


#20

Hey! As promised, I tried the new specification: the sampling goes well (no divergences, good test point, good BFMI, good energy) and the model seems to capture the data a lot better! :partying_face:
However, the chains seem to explore a very small part of the space, so the posteriors are very thin (they are basically point estimates):


As a result, the model is both wrong and overconfident, as shown by pp checks:

Can it be due to the softmax transformation again? How can I improve this type of model - or at least introduce more uncertainty? Big thank you :smiling_face_with_three_hearts:


#21

The estimation does seems biased, but I wouldnt say it is overconfident. I remember when I was using a multinormial hierarchical regression I also see something similar (the trace is less of a problem btw, it looks like point estimation but it is not - it just that different dimension are in different scale). Likely it is regulated too strongly - try using a larger sd for the priors.