# Multinomial hierarchical regression with multiple observations per group ("Bad energy issue")

Hi everyone! To try my hand on hierarchical models with PyMC, I’m trying to adapt @twiecki’s model to a multinomial case with 7 continuous regressors encoded in the variable `X`. The multinomial observations are in `y`, with 7 possible categories.
There are ~2000 observations in total, broke down by counties (~85), which constitute the groups in the model. Thus, there are multiple observations per group.
Here is my work in progress:

``````with pm.Model() as multinom_multivariate:

mu_a = pm.Normal('mu_a', mu=0., sd=100.)
sigma_a = pm.HalfCauchy('sigma_a', 5.)
mu_b = pm.Normal('mu_b', mu=0., sd=100.)
sigma_b = pm.HalfCauchy('sigma_b', 5.)

a = pm.Normal('a', mu=mu_a, sd=sigma_a, shape=(n_counties, n_categs))
b = pm.Normal('b', mu=mu_b, sd=sigma_b, shape=(n_counties, n_categs, n_regressors))

results_est = a[counties_idx, :] + pm.math.sum(b[counties_idx, :, :] * np.expand_dims(X, axis=1), axis=2)
p = pm.Deterministic('p', tt.nnet.softmax(results_est))

likelihood = pm.Multinomial('likelihood', n=y.sum(axis=1), p=p, observed=y)

trace = pm.sample(cores=4)
``````

I can’t find what’s misspecified, although it obviously is, since I can’t even sample:

I went back and checked: my X and y don’t have any NaNs or 0s, so I can’t see why there would be an infinite value in the log-probability…
Anyone has any thought? That would be much appreciated! A huge thanks in advance

Have you try the solution in FAQ?

Also, some times there is problem with the init method, so try `trace = pm.sample(..., init='adapt_diag')` as well.

1 Like

Thanks for your answer Junpeng Lao! Yeah I tried them all (although not sure I could pass obj_optimizer to pm.sample but it didn’t raise any error): none fixed the issue… Same for `init='adapt_diag`

But looking into `likelihood.tag.test_value`, I noticed there are 7 values (on circ. 2000) that are `<0`! Which is troubling for a Multinomial
Will investigate and come back to you

So, after much investigation: turns out my raw data were so corrupt that they contained 7 negative values! So imaginable to me that I didn’t even consider this possibility, but took care of removing `NaNs` and `0s`

Thus, the “Bad energy issue is gone”, but I’m now in quite a pickle, since the trace stops at ~15%, telling me that the “mass matrix contains zeros on the diagonal”:

I tried reparametrizing the model to a non-centered version (as advised here) and diminished the hyperpriors (as advised here), but the issue is still there.
Any ideas on what to do next? I’m quite lost here…

1 Like

This is usually issue of the hyper prior being too flat, which results gradient being too small. Try:

``````    mu_a = pm.Normal('mu_a', mu=0., sd=1.)
sigma_a = pm.HalfNormal('sigma_a', 2.5)
mu_b = pm.Normal('mu_b', mu=0., sd=1.)
sigma_b = pm.HalfNormal('sigma_b', 2.5)
``````

Also, because of the softmax your model is unidentifiable, so you really need to place strong hyperprior on your parameters.

2 Likes

Thanks a lot! Looks like it’s working: the trace is currently at 22% and running
Although it’s take forever to sample (3 hours left), so I suspect the model is misspecified in some way…

Also, I get the gist of it, but can you elaborate on what you mean by “unidentifiable” please? And your switching from a Half Cauchy to a Half Normal for the sigmas aims to get stronger hyperpriors right?

Unidentifiable is referring to the normalization from Softmax - you might have two vector that are different but being the same after softmax.

The HalfCauchy to HalfNormal is because HalfCauchy is too heavy tail and difficult to sample.

1 Like

Yeah ok, I think I got it
Thanks a lot @junpenglao, reactive and helpful, as always!
Will keep you posted!

1 Like

Well that’s unexpected: the model runs with half Cauchy, but not with half Normal (“mass matrix contains zeros on the diagonal”). Any clue why? (I’m running a non-centered version of the model above)

Also, it took 10 hours to sample 24 000 samples and did not converge. It’s a sign I should completely rethink the model, isn’t it?

Do you see any divergence? How are the chains look like?

Yeah there are divergences + the sampler did not converge + GR statistics is > 1.4 + effective number of sample is <200 for some parameters, so basically nothing you’re hoping for
I can’t show you the chains because I made the mistake of not saving the trace before reruning the model with an increased target_accept… which never sampled

But more broadly, I wondered if having hyperpriors centered around 0 made sense in this case (the p in the Multinomial can’t have values close to 0 or 1 in reality), so I changed them randomly:
`mu_a = pm.Normal('mu_a', mu=2., sd=1.)`
`sigma_a = pm.HalfNormal('sigma_a', 2.5)`
`mu_b = pm.Normal('mu_b', mu=3.9, sd=1.)`
`sigma_b = pm.HalfNormal('sigma_b', 1.)`

Although the softmax transformation makes it hard to interpret the “philosophical” meaning of a, b and their hyperpriors, this dramatically improved sampling speed! So I guess I’m on the right track, but I still get the mass matrix error, so still can’t explore the chains:

I don’t know if there is something special to it, but this 2854th derivative of b is always among those valued at zero, whatever the prior for mu_b. But I don’t understand where this 0 comes from… Do you think it has to do with the regressors (e.g my X vector)?
If it’s of any use, here is the model’s test point:

Huge thanks in advance!

Definitely some problem with that row of data (input).

1 Like

Yeah I’m gonna investigate that, but 2 things are bugging me a priori:

1. My X vector doesn’t have 2854 rows: its shape is (1889, 7)…

2. When I switch `mu_b`'s mean from 3.9 to e.g. 9.0, other 0 derivatives appear, which means the problem is fluid (don’t know if it matters, but the input data are standardized):

So, the error being quite vague, I rewrote the model with just 1 obs per cluster & 1 regressor:

``````mu_a = pm.Normal('mu_a', mu=3., sd=1.)
sigma_a = pm.HalfNormal('sigma_a', 2.)
mu_b = pm.Normal('mu_b', mu=4., sd=1.)
sigma_b = pm.HalfNormal('sigma_b', 2.)

a = pm.Normal('a', mu=mu_a, sd=sigma_a, shape=(n_counties, n_categs))
b = pm.Normal('b', mu=mu_b, sd=sigma_b, shape=(n_counties, n_categs))

results_est = a + b * np.expand_dims(X.reg1, axis=1)
probs = pm.Deterministic('probs', tt.nnet.softmax(results_est))
likelihood = pm.Multinomial('likelihood', n=y.sum(axis=1), p=probs, observed=y)
``````

This does sample, but very inefficiently and the trace does not converge, as demonstrated by the chains:

So I tried a non-centered version:

``````a = pm.Normal('a', mu=0., sd=1., shape=n_categs)
b = pm.Normal('b', mu=0., sd=1., shape=n_categs)
sigma_counties_a = pm.HalfNormal('sigma_counties_a', 2., shape=n_categs)
sigma_counties_b = pm.HalfNormal('sigma_counties_b', 2., shape=n_categs)

a_counties = pm.Normal('a_counties', mu=0., sd=1., shape=(n_counties, n_categs))
b_counties = pm.Normal('b_counties', mu=0., sd=1., shape=(n_counties, n_categs))

A = a + a_counties * sigma_counties_a
B = b + b_counties * sigma_counties_b

results_est = A + B * np.expand_dims(X.reg1, axis=1)
probs = pm.Deterministic('probs', tt.nnet.softmax(results_est))
likelihood = pm.Multinomial('likelihood', n=y.sum(axis=1), p=probs, observed=y)
``````

And… the mass matrix error is back! Any idea of what I’m missing my dear @junpenglao? (sorry for the long post, I wanted to test lots of things before answering)

``````Multiprocess sampling (4 chains in 4 jobs)
NUTS: [b_counties, a_counties, sigma_counties_b, sigma_counties_a, b, a]
Sampling 4 chains:  23%|██▎       | 2810/12000 [02:06<44:56,  3.41draws/s]

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/anaconda/envs/cdf/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 73, in run
self._start_loop()
File "/anaconda/envs/cdf/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 113, in _start_loop
point, stats = self._compute_point()
File "/anaconda/envs/cdf/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 139, in _compute_point
point, stats = self._step_method.step(self._point)
File "/anaconda/envs/cdf/lib/python3.6/site-packages/pymc3/step_methods/arraystep.py", line 247, in step
apoint, stats = self.astep(array)
File "/anaconda/envs/cdf/lib/python3.6/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 115, in astep
self.potential.raise_ok(self._logp_dlogp_func._ordering.vmap)
File "/anaconda/envs/cdf/lib/python3.6/site-packages/pymc3/step_methods/hmc/quadpotential.py", line 201, in raise_ok
raise ValueError('\n'.join(errmsg))
ValueError: Mass matrix contains zeros on the diagonal.
The derivative of RV `b_counties`.ravel()[30] is zero.
The derivative of RV `b_counties`.ravel()[63] is zero.
The derivative of RV `b_counties`.ravel()[90] is zero.
The derivative of RV `b_counties`.ravel()[114] is zero.
The derivative of RV `b_counties`.ravel()[122] is zero.
The derivative of RV `b_counties`.ravel()[142] is zero.
The derivative of RV `b_counties`.ravel()[143] is zero.
The derivative of RV `b_counties`.ravel()[154] is zero.
The derivative of RV `b_counties`.ravel()[163] is zero.
The derivative of RV `b_counties`.ravel()[170] is zero.
The derivative of RV `b_counties`.ravel()[175] is zero.
The derivative of RV `b_counties`.ravel()[184] is zero.
The derivative of RV `b_counties`.ravel()[197] is zero.
The derivative of RV `b_counties`.ravel()[203] is zero.
The derivative of RV `b_counties`.ravel()[207] is zero.
The derivative of RV `b_counties`.ravel()[223] is zero.
The derivative of RV `b_counties`.ravel()[226] is zero.
The derivative of RV `b_counties`.ravel()[240] is zero.
The derivative of RV `b_counties`.ravel()[259] is zero.
The derivative of RV `b_counties`.ravel()[260] is zero.
The derivative of RV `b_counties`.ravel()[261] is zero.
The derivative of RV `b_counties`.ravel()[276] is zero.
The derivative of RV `b_counties`.ravel()[277] is zero.
The derivative of RV `b_counties`.ravel()[343] is zero.
The derivative of RV `b_counties`.ravel()[361] is zero.
The derivative of RV `b_counties`.ravel()[383] is zero.
The derivative of RV `b_counties`.ravel()[384] is zero.
The derivative of RV `b_counties`.ravel()[428] is zero.
The derivative of RV `b_counties`.ravel()[429] is zero.
The derivative of RV `b_counties`.ravel()[436] is zero.
The derivative of RV `b_counties`.ravel()[458] is zero.
The derivative of RV `b_counties`.ravel()[465] is zero.
The derivative of RV `b_counties`.ravel()[473] is zero.
The derivative of RV `b_counties`.ravel()[502] is zero.
The derivative of RV `b_counties`.ravel()[506] is zero.
The derivative of RV `b_counties`.ravel()[514] is zero.
The derivative of RV `b_counties`.ravel()[544] is zero.
The derivative of RV `b_counties`.ravel()[560] is zero.
The derivative of RV `b_counties`.ravel()[562] is zero.
The derivative of RV `b_counties`.ravel()[566] is zero.
The derivative of RV `b_counties`.ravel()[616] is zero.
The derivative of RV `b_counties`.ravel()[633] is zero.
The derivative of RV `b_counties`.ravel()[646] is zero.
The derivative of RV `a_counties`.ravel()[38] is zero.
The derivative of RV `a_counties`.ravel()[52] is zero.
The derivative of RV `a_counties`.ravel()[59] is zero.
The derivative of RV `a_counties`.ravel()[142] is zero.
The derivative of RV `a_counties`.ravel()[402] is zero.
The derivative of RV `a_counties`.ravel()[437] is zero.
The derivative of RV `a_counties`.ravel()[444] is zero.
The derivative of RV `a_counties`.ravel()[472] is zero.
The derivative of RV `a_counties`.ravel()[486] is zero.
The derivative of RV `a_counties`.ravel()[510] is zero.
The derivative of RV `a_counties`.ravel()[513] is zero.
The derivative of RV `a_counties`.ravel()[534] is zero.
The derivative of RV `b`.ravel()[2] is zero.
"""``````

It is a bit difficult to say without the data - what is it looks like when you use a non-hierarchical model?

Your wish is my command @junpenglao: here is a gist
(let me know if something’s missing or messed up, but normally you can access everything)
Thank you!

Overall, your model is over-specified.

So my approach to these kind of problem is usually the following:

1. Understand the structure of the input. I usually do a pairplot of the predictors. In this case, it is clear that you have predictors that are common among departments, and you have N departments (usually treated as random effect)

2. Start with a simple model that capture the structure of the data.
In your case, since it is a multinomial regression, you need to be careful re normalization in the softmax. I find that the most stable way is to model `k-1` categories and stack a columns of 0s at the end:

``````# make sure there is no missing data here
groupX = X_full.loc[idx['seinemaritime',:], :].iloc[:, :n_regressors-1].values

Ndpt = len(np.unique(dptmts_idx))
date_idx, date_names = X_full.index.get_level_values(1).factorize(sort=True)

with pm.Model() as hier_model:
dpt_interp = pm.Normal('dpt_interp', 0., 1., shape=(Ndpt, n_parties-1))
fixed_effect = pm.Normal('fixed', 0., 1., shape=(n_regressors-1, n_parties-1))
random_effect = pm.Normal('random', 0., 1., shape=n_parties-1)

results_est = dpt_interp[dptmts_idx] +\
tt.dot(groupX, fixed_effect)[date_idx]+\
X_full['chomage'].values[:, None]*random_effect
results_est = tt.concatenate(tensor_list=[results_est,
tt.zeros((X_full.shape[0], 1))],
axis=1)

probs = pm.Deterministic('probs', tt.nnet.softmax(results_est))
likelihood = pm.Multinomial('likelihood',
n=y_full.sum(axis=1),
p=probs,
observed=y_full.values)
# might need to increase tunning
trace = pm.sample(1000, tune=1000, cores=4, random_seed=RANDOM_SEED)
``````
1. add hyperpriors. I dont think it is too useful to put hyperprior on the `mus` of the coefficients (again, given the softmax), but it might help to put prior on the sigmas (even hierarchical ones)
1 Like

That’s awesome, thank you Junpeng Lao!
I’ll try that ASAP and come back to you
(sorry for the late answer, I was on holiday)

Hey! As promised, I tried the new specification: the sampling goes well (no divergences, good test point, good BFMI, good energy) and the model seems to capture the data a lot better!
However, the chains seem to explore a very small part of the space, so the posteriors are very thin (they are basically point estimates):

As a result, the model is both wrong and overconfident, as shown by pp checks:

Can it be due to the softmax transformation again? How can I improve this type of model - or at least introduce more uncertainty? Big thank you

The estimation does seems biased, but I wouldnt say it is overconfident. I remember when I was using a multinormial hierarchical regression I also see something similar (the trace is less of a problem btw, it looks like point estimation but it is not - it just that different dimension are in different scale). Likely it is regulated too strongly - try using a larger sd for the priors.

1 Like