Advice with hierarchical model

I have a three level hierarchical model that has:

  1. hyper-hyper parameters (for populations)
  2. hyper parameters (for sub-populations)
  3. measurements

The measurements are N(mu, sigma) where the mu and sigma are sampled from the hyper parameters.
The hyper-parameters for each sub-population are mu ~ N(hyper-mu, hyper-sigma) and sigma ~ Gamma(mu=hyper-sd-mean, sigma=hyper-sd-sd).
I have been having troubles with divergence, so first I reformulated the mus as recommended using

offset=pm.Normal('offset', mu=0, sd=1)
mu = pm.Deterministic('mu', hyper_mu + offset * hyper_sigma)
pm.Normal('measurement', mu=mu, sd=sd)

That worked, but didn’t solve my divergence problem. So I moved to the next step, and reformulated the sd:

sigma_offset = pm.Normal('sigma offset',  mu=0, sd=1)
sigma = pm.Deterministic('sigma', largest(hyper_sigma_mu +  sigma_offset * hyper_sigma_sd, 0))

But when I added this my chains started failing, with this error:

ValueError: Bad initial energy: inf. The model might be misspecified.

It’s quite possible either (a) there’s a deep reason behind this which I don’t know or (b) I have done something stupid (like a typo).
Can anyone suggest either what is wrong with reformulating the standard deviation hyper-parameters like this or how to find whatever bug there is? I suppose the latter would be to somehow look at the probability function so I can see what is wrong with my code?
I’m lost here and would be very grateful for any guidance.

Perhaps you could try sampling from the prior? pm.sample_prior_predictive will generate samples from your specified priors, and you can manually inspect them to see if they make sense. There could be something wrong with how your model specifies the priors.

In fact, this is actually one of our FAQs, perhaps you could look here for more info: Frequently Asked Questions

Thanks. I have been doing that. I have seen that my variance parameters seemed like they could hit zero which could cause some of my divergences and chain failures. I have eliminated those, but my divergences remain. In traceplot() output, in the frequency charts, I see some of the variables, for some of the chains have ‘spikes’ – very tight points with high frequencies. Am I right in thinking that these are suggestive of divergences?

Thanks. I looked at the FAQ (might be a good idea to link from the PyMC3 docs to it), but am not sure I see how to proceed further, now that I have removed the chain failure (by ensuring that the variance parameters never go to zero), but my divergences remain.

Is it correct to think that this could be because my variance parameters are still too low, and that’s why the model gets to a point where the gradient is enormous?

Thanks

+1 to making a FAQ to the docs! I’ll keep that in mind for a future PR :slightly_smiling_face:

Here are some tips on divergences that seem applicable to your situation taken from my blog post. I might’ve already linked you to the post, but I can’t exactly recall, so better safe than sorry.

  • Inspect the pairplot of your variables one at a time (if you have a plate of variables, it’s fine to pick a couple at random). It’ll tell you if the two random variables are correlated, and help identify any troublesome neighborhoods in the parameter space (divergent samples will be colored differently, and will cluster near such neighborhoods).
  • Maybe trying increasing target_accept? Usually 0.9 is a good number (currently the default in PyMC3 is 0.8). This will help get rid of false positives from the test for divergences. However, divergences that don’t go away are cause for alarm.
  • Increasing tune can sometimes help as well: this gives the sampler more time to 1) find the typical set and 2) find good values for step sizes, scaling factors, etc. If you’re running into divergences, it’s always possible that the sampler just hasn’t started the mixing phase and is still trying to find the typical set.

Thanks for the advice. It seems like nudging my variance parameters (and hyper-parameters) up was necessary to fix the problem. Once I ruled out sigma of 0, the divergences went away.
One remaining problem that I haven’t read about – I am getting complaints about there being not many effective samples for the variables that I added in the reparameterization (i.e., the normals that are used as the offset). Any intuitions about whether this is a real problem or not?

By the way, sampling from the priors was very helpful. But it raised an issue with PyMC3 (at least I think it’s an issue). I don’t believe that there’s a plotting function for the output of sample_prior_predictive (if there is one, I didn’t find it in the docs). The sampler was neat and its output helpful, but I had to spend a few hours getting the plotting set up. In particular, auto-detecting and dealing with vector-valued random variables was a pain – good things I didn’t have any matrix-valued ones!
This would be a great feature to add. If it would help to see mine, I’d be happy to share it, but I’m not a good enough pyplot user to contribute it as a PR.

Again quoting straight from the blog post (we really should port all these tips into an FAQ, shouldn’t we?):

The number of effective samples is smaller than XYZ for some parameters.

  • Quoting Junpeng Lao on discourse.pymc3.io: “A low number of effective samples is usually an indication of strong autocorrelation in the chain.”
  • Make sure you’re using an efficient sampler like NUTS. (And not, for instance, Metropolis-Hastings. (I mean seriously, it’s the 21st century, why would you ever want Metropolis-Hastings?))
  • Tweak the acceptance probability ( target_accept ) - it should be large enough to ensure good exploration, but small enough to not reject all proposals and get stuck.

Also, I think that prior visualization will be moved to ArviZ, see this issue. But we appreciate any code you think would be good to contribute!

Thanks, again. Sorry to have taken so long to respond. Your advice about divergence was very helpful, and between your site and the two articles and the lecture video it pointed me to, I was able to solve my divergence problems.

A quick follow-up – do you know of resources like that for dealing with failure to get sufficient effective samples?

Here’s my case – I did the divergence trick of replacing

sigmaRVs = pm.Gamma(self.sigma[i],
                   mu=hyper_sigmaRVs_mu[i],
                   sd=hyper_sigmaRVs_sigma[i],
                   shape=self.num_replicates)

with this:

sigmaRVs = [pm.Deterministic('%s replicate %d' % (self.sigma[i], j),
                            largest(hyper_sigmaRVs_mu[i] +
                                    (sigma_offset[j] * hyper_sigmaRVs_sigma[i]),
                                    0.3))
                    for j in range(self.num_replicates)] 

The good news is that this solved my divergence problem (well, that and raising my variance priors, which were too low).

The bad news is that the sigmaRVs in the new models are getting hardly any effective samples. The sigma_offset RVs get between 75-150 effective samples in chains of 3000, and the sigmaRVs approximately 2!

These sigmas, and corresponding mus, fit as hyperparameters into a normal RV that is observed.

The mu RVs show more effective samples, but the same relative relationship: their offset variables have two orders of magnitude more effective samples. The offset RVs have about 2200 effective samples in 4 chains of 1000 samples, each, and the mu RVs only 20-25.

Can anyone suggest something I might do to this model to improve things, or do I just need more iterations?

Thanks