Question about PyMC3 inference in hierarchial model

Per my earlier emails, I have a hierarchical Bayesian model that has the following structure:

  1. The “meta model” with hyperparameters that are a normal (for mean) and a gamma (for stdev)
  2. The “noise model” also with hyperparameters that are a normal (for mean) and a gamma (for stdev)
  3. A normal whose mean and stdev are the sums of the normals and gammas, respectively.
  4. Approximately 2500 samples from 3.

Here’s the term for 4:

output = pm.Normal('output ',
                                 mu=(muRV + noise_mu),sd=(sigmaRV+noise_sigma), observed=samples)

To my surprise, this gives very tight bounds on the meta model and noise model parameters, which are fairly loose.
I am surprised because if I understand this model generatively, we pick four values from the hyperparameters: muRV, noise_mu, sigmaRV, and noise_sigma. Those constitute the parameters of the output variable, about which we have 2500 samples.
Now, thinking about this generatively, we are getting a pretty good characterization of output, which seems to have a pretty tight distribution. But that should not translate into tight bounds on the four hyperparameters, because (a) we very little information about the respective contributions from the meta model and the noise model and (b) there are a lot of ways that a single normal random variable could be generated from the hyperparameters.

The hyperparameter priors seem reasonably weak to me:

  1. muRV ~ N(4.4, 2)
  2. sigmaRV (SD of model normal) ~ gamma(alpha=1.8,beta=2)
  3. noise mean ~ N(0,4)
  4. noise sigma ~ Gamma(2, 0.5)

The values of output are logs, and vary between 0 and 6.

Maybe the prior variance on the noise term is too tight. But alternatively, it’s quite likely I’m doing something terribly wrong either in specifying the model or in doing the sampling (I just use default settings and 50K samples).

Thanks for any advice you could offer.

It is a bit difficult to say without the full model data, but at least mu=(muRV + noise_mu),sd=(sigmaRV+noise_sigma) seems a bit unconventional to me. As noise usually is considered to have zero mean, think about it this way: if the noise has nonzero mean \mu_n, you can rewrite \mu_{RV} into \mu_{RV}^\prime = \mu_{RV} + \mu_n which makes \mu_n^\prime = 0.

If alternatively, you are assigning noise component the same size as the observation (2500 samples), you have an overspecified model that makes explained all the uncertainty in your model and data, resulting a tight posterior.

The “noise” here isn’t really noise – it’s intended to be variance that comes from a specific instantiation of a process. That is: the “signal” in the model is Gaussian, and comes from a designed process. The noise is intended to be a model of what happens when that model is implemented in a particular location. So one expects that this would be zero (that’s the mean of the prior), but if there’s something systematic going on at that location, then the mean could be non-zero. So “noise” probably was not the right name for this.

Note that I don’t pass shape arguments for the hyperparameters, so I don’t think I have overspecified the model.

The intention would be to explore replication of this design by comparing how it plays out at a different location, where the “noise” would potentially have a different shape. This gets at a question about whether the design description captures the desired behavior, or if it just happened to work at one location.

I have data for other locations, but I was very surprised that this model converged on very tight distributions, which made me suspect that either (a) I was expressing something different from what I wanted in the model or (b) I was using the sampler wrong (e.g., maybe the way I’m using it the hyperparameters get “stuck” at something that looks good rather than noticing that there are a lot of possibilities that look good.

Is there some other inference method I could try to see if NUTS is just getting stuck? Or is there something I should do to ensure that the hyperparameter value space gets explored widely? I’m afraid I don’t really understand the documentation about “Inference” in the PyMC3 docs very well.

If the variables were discrete (I suppose I could try discretizing them), I would try forward rejection sampling…
Thank you.

If you use multi chain usually NUTS is quite good at showing the multi mode. I would more suspect that you are expressing something different from what you wanted in the model.

You can generate fake data from the model and check with your actual observed as a check. More formal treatment on this could be found in https://arxiv.org/abs/1804.06788

Thanks: I’ll try the generation now, and have a look at the paper.
But I’m not sure how to evaluate this (maybe the paper will help): over-tightening the hyper-parameters could easily produce data that looks good. Putting it differently, my problem seems to be that I’m getting better prediction of the data than is justified, not worse.
Thanks,
r

I think I see the problem. Am I calculating not the posterior distribution of the hyperparameters, but instead the posterior of the sample from the hyperparameters generating this set of data?
If so, that would account for the problem. I.e., I would not be finding the probability distribution over populations, given this population, but the distribution over the single sample from the hyperparameter distributions that gave rise to this set of data?
That is, I am trying to find information about a higher order probability distribution over possible populations, but I’m instead getting only information about this population. Is that right?
Thanks!

Well, you can do traceplot, that is the most intuitive way to see how much uncertainty it is.
If you are computing the standard deviation of the posterior, then you are quantifying the Monte Carlo error of the posterior expectation. It is not the same as how much uncertainty of the posterior, which usually capture by a density interval.

Thanks. Maybe someone could give me a pointer to an example. In most of the examples I have found of Bayesian approaches with hierarchical models, the hyperparameters are used as a way to give priors to distributions of interest (e.g., a normal about which we have some subjective expectations, but no data yet). They are just used as a tool for reasoning about the random variable for which they provide parameters.

For example, in the Rugby example, there are hyperparameters sd_att and sd_def that are used to give parameters to att_star and def_star, which are attacker and defender strength random variables for the teams in the league. But the hyperparameters are not of interest in themselves (as far as I can tell), and nothing is done with them later: the focus is all on the team-specific strength RVs.

In my problem, I want to use the hyperparameters to reason about a case where there are multiple populations for the same underlying process, and I am interested in that underlying process. So I would like an example where the hyperparameters themselves are of interest and get updated and investigated.

I suppose such an example might be framed in terms of populations and sub-populations, instead of hyperparameters.

I do appreciate all the advice: I’ve previously worked only with discrete data, where the Bayesian stance seems much more comprehensible (e.g., the interpretation of Dirichlet priors has a nice intuition behind it), and I’m working hard to catch up here.