Sampler occasionally gets static in pymc3.1, but not in pymc3.0

Dear developers,

I’ve noticed a problem that is most likely sampler related and was introduced with the upgrade to pymc3.1. This happens on sampling a hierarchical (two level) GLM.


I observe that the sampler occasionally gets stuck in parameter space (not: temporal slowdown), exploring only a very restricted fraction of possible values, before returning to the “hairy caterpillar” behavior. This temporary local “attractor” can be within or outside the variance range of the rest of the trace. If it happens, all model parameters are affected together.
In addition, I get an “acceptance probability” warning, even if I choose extensive tuning and a high “target_accept”.

Note that in Version 3.0, there are also occasional “spikes” on the trace. However, the sampler returns to the usual range much more quickly.

Please find here some additional documents:

- traceplot from pymc3 version 3.0:
- traceplot from pymc3 version 3.1:
- console outputs on my machine

(sorry for text links and external storage, this is to circumvent the “two links” restriction for newbies)

Model configuration

My model follows introductory examples from Thomas Wiecki and others (“bayesian-glms-3”, “bayesian-hierchical-non-centered”).
When using an “offset” (non-centered) reparametrization suggested by Thomas Wiecki in the latter notebook, the problem seems to occur less often and only for shorter periods (unquantified). Hence I suspect this might be related to sampler coverage. Also, when I switch to Slice sampler (default is NUTS) this still happens, but less frequently. The problem persists through the very recent switch do a default NUTS initialization via “jitter+adapt_diag”.


The data is limited to four subjects and twenty repeats per condition for each of them; I’ve reduced it to only one test condition and one measured value. Regarding model specification, I’ve tried many variants, but I am happy if anyone can point me at a conceptional bug. My knowledge leaves me when it comes to sampler properties and step choice. The reference model definition in the attached script can also be found below.


If somebody with a deeper knowledge of sampling procedures could take an educated guess on what’s going wrong, that would already help.
I would hope to get help with the model definition to get this working with the little data we have.
Alternatively, maybe someone knows a configuration parameter in “sample” that can restore the old behavior?


I install the different pymc versions as follows:

  • pymc3.1 latest via “git clone”/“git pull” and “pip install .”
  • pymc3.0 via “pip install pymc3==3.0”
    Python is python 3.6; all required packages are up to date.
    My system is Manjaro Linux; Kernel 4.9.39.


Thomas Wiecki was very supportive when I sent him my problem previously, but suggested to bring this here. Any help is much appreciated.

Thanks to everyone in advance!


with PM.Model() as hierarchical_model:
hypermu_intercept = PM.Normal(‘hypermu intercept’, mu=0., sd=100)
hypersigma_intercept = PM.HalfNormal(‘hypersigma intercept’, sd=100)
hypermu_slope = PM.Normal(‘hypermu slope’, mu=0., sd=100)
hypersigma_slope = PM.HalfNormal(‘hypersigma slope’, sd=100)

    residual = PM.HalfCauchy('residual', 5)

    intercept = PM.Normal('intercept', mu=hypermu_intercept, sd=hypersigma_intercept, shape=n_subjects)
    slope = PM.Normal('slope', mu=hypermu_slope, sd=hypersigma_slope, shape=n_subjects)

    estimator =   intercept[subject_idx] \
                + slope[subject_idx] * data.is_altered.values

    likelihood = PM.Normal(  'likelihood' \
                            , mu = estimator \
                            , sd = residual \
                            , observed = data[predicted_variable]\

Can you post the trace plots as well?

This is a very interesting case. Thank you for taking the time to write about this.
I’m not entirely sure yet what is going on here, but this is what I’m currently thinking about.

First, to get people an idea what this looks like:

So far as I can tell the problem is that the posterior really is quite complicated. There isn’t enough information in the dataset to get tight estimates for some of the scale parameters, and not even for the global intercept. This leads to long tails, correlations and some funnel-like behaviour in the posterior. The long tails are what you can see as “spikes” in the trace, funnels and correlation are visible in scatter plots between different variables.
I experimented a bit with different parametrizations, and this seems to work (it doesn’t give me divergences on master). I also renamed your variables to my own convention, that made it easier to work with. I hope it’s not too confusing.

with pm.Model() as hierarchical_model:
    intercept = pm.Normal('intercept', mu=0., sd=5)

    subject_sd_raw = pm.Uniform('subject_sd_raw', upper=np.pi/2)
    subject_sd = 0.2 * tt.tan(subject_sd_raw)
    subject = pm.Normal('subject', mu=0, sd=1, shape=n_subjects)

    altered_mu = pm.Normal('altered_mu', mu=0, sd=1)
    altered_subject_sd_raw = pm.Uniform('altered_subject_sd_raw', upper=np.pi/2)
    altered_subject_sd = tt.tan(altered_subject_sd_raw)

    altered_subject = pm.Normal('altered_subject', mu=0, sd=1, shape=n_subjects)

    is_altered = data.is_altered.values
    estimator = (intercept
                 + subject_sd * subject[subject_idx]
                 + altered_mu * is_altered
                 + altered_subject_sd * altered_subject[subject_idx] * is_altered)

    residual = pm.HalfCauchy('residual', 5)
    nu = pm.Gamma('nu', 2, 0.1)
    likelihood = pm.StudentT('y', nu=nu, mu=estimator,
                             sd=residual, observed=data['measurement1'])

I used a little trick to reparameterize subject_sd, the tan(uniform(0, pi/2)) is half-cauchy distributed.

I sample with

with hierarchical_model:
    trace = pm.sample(2000, tune=1000, njobs=2, nuts_kwargs={'target_accept': 0.95})

This is still far from perfect, eg look at that scatter plot:

plt.scatter(trace['altered_mu'], trace['altered_subject_sd_raw_interval__'], marker='.')

Model fit

Part of the problem here might also be, that maybe the model doesn’t fit the data all too well. Does it really make sense to model the change in the measurement between altered and not altered as a normal? Just from looking at the data it looks like maybe it would make sense to say that the alteration leaves a lot of subjects mostly unchanged, but affects some by a larger amount? If we were to assume that only few subjects are really affected, than we could use a scale-mixture prior like the horseshoe for the effect sizes:

with pm.Model() as hierarchical_model:
    intercept = pm.Normal('intercept', mu=2.5, sd=1)

    #subject_sd = pm.HalfNormal('subject_sd', sd=5)
    subject_sd_raw = pm.Uniform('subject_sd_raw', upper=np.pi/2)
    subject_sd = tt.tan(subject_sd_raw)
    subject = pm.Normal('subject', mu=0, sd=1, shape=n_subjects)

    altered_subject_tau = pm.HalfNormal(
        'altered_subject_tau', sd=0.2)
    #altered_subject_theta = pm.HalfStudentT(
    #    'altered_subject_theta', nu=3, sd=1, shape=n_subjects)
    altered_subject_theta_raw = pm.Uniform(
        'altered_subject_theta_raw', upper=np.pi/2, shape=n_subjects)
    altered_subject_theta = tt.tan(altered_subject_theta_raw)
    altered_subject_eta = pm.Normal(
        'altered_subject_eta', mu=0, sd=1, shape=n_subjects)
    altered_subject = (altered_subject_tau
                       * altered_subject_theta
                       * altered_subject_eta)
    altered_subject = pm.Deterministic('altered_subject', altered_subject)

    is_altered = data.is_altered.values
    estimator = (intercept
                 + subject_sd * subject[subject_idx]
                 + altered_subject[subject_idx] * is_altered)

    residual = pm.HalfCauchy('residual', 5)
    likelihood = pm.StudentT('y', nu=7, mu=estimator, sd=residual, observed=data['measurement1'])

This leads to effect size estimates like this:

I guess it is pretty hard to say something meaningful about the distribution of subject_altered with just 4 samples. Some domain specific knowledge might help though. :slight_smile:

PyMC3 versions

About the changes between the pymc3 versions: At the moment I’m pretty much at a loss as to why it would seem to work with 3.0. My best guess so far is that it actually doesn’t, there just weren’t the diagnostics around back then to tell you. But really, I’ll have look into that.

Model checking

You should also compare the dataset with predictive posterior samples, and use that to check the likelihood. The pick nu=7 was pretty much blind, and might not make much sense. From the plot, maybe nu should be a bit smaller?

All together this looks like a perfect example for Riemannian Hamiltonian methods (small number of parameters, really complicated posterior). If we ever get that to work properly, is the dataset public? It might make a nice case study.

1 Like

Ah, something I forgot: It might also help if you can provide informative priors for some of the parameters (I think especially subject_sd, and intercept?). I guess that could only come from more knowledge about what this is actually about :slight_smile:

Dear Thomas,

I put the traceplots to my own webspace (linked in a code block in the original post), being restricted to two links or one image by the forum regulations.
Here you go:
PyMC 3.0:

PyMC 3.1:


One possible is that in v3.0, after the ADVI init the mass matrix is fixed and overestimated. So it just taking large step size and completely ignore the funnel. And on master currently we have adapt_diag, which sometimes (depending on the starting point) might adapt to a very narrow local geometry.

Dear Adrian,

thanks for taking the time. There’s a lot to wrap my head around, your insight to the data is amazing and very accurate.
We’re currently striving to publish this, and even decided to put the focus of the paper on the hierarchical modeling. I intend to submit supplementary code and full data. I pruned the data for now, but once we’re under review, I’m happy to provide a full notebook with all detail.

Some background: what we’re doing is locomotor analysis of animals, tracking points and doing kind of “motion capturing”. The example measure I sent was “speed” (measured at the nose) while the animal is locomoting in a fast gait. This is typically a type of study with limited data and especially few subjects, because although our experiments are non-invasive and conducted under very natural conditions, this is highly regulated by authorities (which is good). We cannot get more individuals at the moment. We could analyze more runs (data points per subject), but decided against it for another reason (exclude sample size bias in comparison with commonly applied frequentist’s statistics); the twenty runs were chosen at random. The “condition” is an alteration of the extent of a plexiglas enclosure.

Now because the animal we look at is a locomotor specialist (which is usually the case), it is hard to put in a prior. Of course we have expectations, but you never know if any effect will surpass intra- and interindividual variability. Also, it is quite typical that there are individual outliers, which I hoped to model appropriately: I guess it is reasonable to expect long tails, but I have no a priori knowledge that may exclude the Normal assumption. I would definitely want to avoid loading the model with any “belief”.

Do you have any suggestions on how to argue for reasonable, more specific prior assumptions in such cases?

Thanks again for now. I’ll read up on everything you wrote and get back with questions :wink:


Dear Adrian,

I got your model to work easily. However, the problem is not gone: it’s maybe just happening more rarely. See this trace (at ca. 4500):

The code is below, I only loosened prior constraints.

If what Junpeng suggests is true, and it sounds plausible, then the reason that your reparametrization did not show any flat traces in the first place was maybe the narrow prior range. I changed this moderately (sd=10), because I think it is good to load as few assumptions as possible.

further questions

I did not understand the factor of 0.2 in your first model.

Could you please explain where it comes from?

Did I get it right, that your suggestion did not account for a potentially different standard deviation of the subjects? With the global intercept and the “subject_sd * subject”, only an individual intercept is incorporated, but no individual variation. This might be reasonable, yet I’m trying to understand if it might be another involuntary change that made sampling “easier”.

Finally, your “scale-mixture prior” could be a good alternative. In fact, our hypothesis is that the “condition” does nothing and that the outlier is a “weirdo”. However, I do not get the meaning of tau/teta/eta. Do you have any reference or example where I could read this up?



Here’s how I changed the priors (TT = theano.tensor, NP = numpy, PM = pymc3.1 master):

    # Model
    with PM.Model() as hierarchical_model:
        intercept = PM.Normal('intercept', mu=0., sd=10)

        subject_sd_raw = PM.Uniform('subject_sd_raw', lower = 0, upper=NP.pi/2)
        subject_sd = 0.2*TT.tan(subject_sd_raw)
        subject = PM.Normal('subject', mu=0, sd=10, shape=n_subjects)

        altered_mu = PM.Normal('altered_mu', mu=0, sd=10)
        altered_subject_sd_raw = PM.Uniform('altered_subject_sd_raw', lower = 0, upper=NP.pi/2)
        altered_subject_sd = TT.tan(altered_subject_sd_raw)

        altered_subject = PM.Normal('altered_subject', mu=0, sd=10, shape=n_subjects)

        is_altered = data['is_altered'].values
        estimator = (intercept
                     + subject_sd * subject[subject_idx]
                     + altered_mu * is_altered
                     + altered_subject_sd * altered_subject[subject_idx] * is_altered)

        residual = PM.HalfCauchy('residual', 5)
        nu = PM.Gamma('nu', 2, 0.1)
        likelihood = PM.StudentT('y', nu=nu, mu=estimator,
                                 sd=residual, observed=data[predicted_variable])
        trace = PM.sample( \
                          draws = 10000 \
                        , tune = 1000 \
                        , njobs = 4 \
                        , nuts_kwargs={'target_accept': 0.95} \

Just a quick post, I’ll have another look tomorrow:

  • Paper about horseshoe:
    it also mentions a couple of similar priors with more or less shrinkage. The tau, theta and eta vars build a horseshoe prior together.
  • The 0.2 in subject_sd scales the HalfCauchy. This should be equivalent to using pm.HalfCauchy('subject_sd', beta=0.2).
  • subject_sd is the same as your hypersigma_intercept. intercept is the same as your hypermu_intercept, and subject is your intercept (in noncentered parametrization). I think that is the same as in your version, but maybe I’m missing something.

About the traces: Yes, that looks like trouble again. You should also get warnings about divergences. In a proper sampler run you don’t want any of those. Even a single divergence is a clear sign of trouble. You could try to increase target_accept to maybe 0.99 and see if that helps.

FYI, if you are doing hierarchical linear model or linear mixed model, bambi is a really good library build on top of PyMC3 (and PyStan). It’s similar to brm in R. In my experience they reparameterize the model really well.

Thanks, I’ve seen bambi before and might try it this time.
I personally do not like the terminology/notation of “fixed” and “random” effects and the way R treats this. I see that bambi mimics model creation in R, and that is good if you come from an R background. Yet I prefer python :wink:
What’s more, this way of initialization is like a black box to me. I prefer to specify priors explicitly, understanding what happens behind the curtains and learning something on the way.

1 Like

Dear Adrian,
I’ve also tested the “horseshoe”-Variant now. So far I do not see any signs of divergence (edit: still true after lot of sampling), so maybe it’s the way to go.
EDIT: I missed on first glance that there seems to be no model parameter measuring the change due to altered condition on population level. It only returns “altered_subject”, but no “altered”. Hence the model seems to be hierarchical with regard to intercept, but not slope. Did I miss something?

EDIT: the following only applies for some of the model parameters. distributions are not wide in general. However, the general question remains.

However, since I increased prior SD again, the posteriors are very wide. This might have to do with the limited size of my data set.

General question:
Is there any way to estimate if my data is actually sufficient for a certain model complexity? Something in line with the thinking of Boettiger et al. (2012)? I often have the feeling that, though Hierarchical models are an excellent tool, they might be more demanding than plain hypothesis testing in terms of sample size on each level. This is rarely discussed and it would be great to quantify.

Regarding Junpeng’s suggestion on the difference between v3.0 and v3.1: is there currently a way in master to ignore the funnel? I see the general advantages of accounting for the funnel, yet in my case with outlier components in the data probability density, I think it’s reasonable to base step size only on the width of the major mass component.



Boettiger, C., Coop, G., and Ralph, P. (2012). Is your phylogeny informative? measuring the power of comparative methods. Evolution, 66(7):2240–2251.

Dear all,

I managed to solve this problem (code below for reference). Yet some questions remain.
It turns out that the divergent periods disappear if I remove the hyperprior for standard deviation (hypersigma_slope in my original model). The working model now still uses non-centered reparametrization and has moderately wide prior distributions. I still get spikes, yet they are not persistent.

I guess the data sample size is just at the edge of being insufficient for the model. Feel free to close the thread, but see below.

Sorry for the lack of structure in my previous post. There are two general questions about which I would appreciate a brief discussion:
(1) Is it possible in pymc3.1 master, via a sampler parameter or general setting, to fix step size (mimic v3.0 behavior), ignoring the funnel at own risk? (assuming we currently lack a more sophisticated way to avoid that the sampler gets trapped in some weird local geometry)
(2) Is there a general method to evaluate the relation of data coverage and model complexity? Can we test if data is sufficient to distinguish between two models? (see Boettiger paper from my previous comment)

Thanks for everyone’s input. I learned a lot!



with PM.Model() as hierarchical_model:
    ### population intercept
    intercept_population = PM.Normal('intercept_population', mu = 0., sd = 10)
    sd_population = PM.HalfCauchy('sd_population', beta = 5)

    ### individual means
    intercept_subject = PM.Normal('intercept_subject', mu = intercept_population, sd = sd_population, shape = n_subjects)

    ### priors for "altered"
    # population level
    slope_altered_population = PM.Normal('slope_altered_population', mu = 0., sd = 10)

    # individual level
    offset_altered_subject = PM.Normal('offset_altered_subject', mu = slope_altered_population, sd = 10, shape = n_subjects)
    sd_altered_subject = PM.HalfCauchy('sd_altered_subject', beta = 5)

    slope_altered_subject = PM.Deterministic('slope_altered_subject', offset_altered_subject * sd_altered_subject)

    ## estimator
    is_altered = data['is_altered'].values
    estimator = intercept_subject[subject_idx] \
              + slope_altered_subject[subject_idx] * is_altered \

    residual = PM.HalfCauchy('residual', 5)
    dof = PM.Gamma('dof', 2, 0.1)
    likelihood = PM.StudentT( 'likelihood' \
                            , nu = dof \
                            , mu = estimator \
                            , sd = residual \
                            , observed = data[predicted_variable] \

    return hierarchical_model

You can set the init='advi' in pm.sample()

In theory, there should be no data coverage problem - you are sampling from the posterior distribution (conditioned on your model and data). If you have too few data your prior just kind of take over and there are lots of uncertainty in your estimation. What the problem is more if you have complex geometry and the sampler cannot effectively take samples in a finite time.

Just a couple more comments:

  • I kind of messed up with the horseshoe prior. Thinking about this a bit more (and looking at random draws from a horseshoe) I don’t think anymore that this is a good idea. The horseshoe really is very sparse. Instead, maybe a student-t or laplace makes more sense. But it doesn’t actually seem to make much difference in the estimates.
  • Reasonable people might disagree about this, but I don’t think that wide priors are in a sense a “safe option” that you can use to avoid making assumptions. They are assumptions too, and often very unrealistic assumptions. A wide prior on a scale parameter for example also says that you think that the scale is probably large, after all, most mass is far from 0. In my opinion it is usually better to fit models with different reasonable priors and compare how different choices affect the posterior. As an example of how you could do that I wrote a little notebook. I’m not sure how useful the interpretation of the prior as “belief” really is, I usually tend to think about priors more as part of the model – a way to add regularization. I guess you could also think about them in terms of sampling distributions of parameters across different experiments. But just from personal experience there is a cost to trying to be “objective” and use wide priors, this often has unexpected and nasty consequences.
  • You seem to assume that you need to pay for more complicated models by needing more data. I don’t see why this would be the case. Overfitting doesn’t occur the same way as it does in many classical approaches. You can still get something similar to overfitting if you use bad priors (often priors that are to wide and don’t provide regularization :slightly_smiling_face:) or if your choice of model depends on the observed dataset (which is usually the case to be fair). If you were to fit a simple ols model like y ~ 1 + subject + altered to the dataset I would bet it would tell you that altered > 0 with p << 0.05. (I haven’t actually tried…). My interpretation of the posterior of the hierarchical model would be: There is definitely an effect on some of the subjects, but it is not entirely clear how the population mean is affected – although it looks like the population mean might be increased slightly. The second seems to me a more accurate way to interpret that dataset, it doesn’t somehow “require more data”.
  • About the problem of finding good samples sizes and experimental designs: I don’t know of a good solution to that problem. I usually do a rough frequentist power analysis and then simulate a couple of datasets, and look at the posteriors to better understand what might or might not work. This is hardly satisfying, and so far it never worked out the way I expected. Some part of the model I wrote down beforehand always turned out to be nonsense, or my guesses for some parameters turned out to be far from ideal. The paper you mentioned sounds interesting, I’ll have a look at that when I have the time. :slightly_smiling_face:
1 Like

Dear Jungpen,

I had tried this before - still the outcome differed in v3.0 and v3.1. I also tried to init nuts with “nuts” and all the other options. Adaptive step size seems to be independent of initialization.
Thanks for clarifying the geometry issue again.

Dear Adrian,
your prior philosophy sounds very reasonable, iff one really conducts the prior sensitivity analysis you suggest and demonstrate in your notebook. However, it is a modeling-perspective that is fundamentally different from what “average” people are used to who were primed on frequentist statistics. What I (and reviewers potentially) want to see as a posterior is in fact only the data distribution. Hence, I initially turned to the “weakly informative prior” idea: specifying a very unspecific distribution, possibly weighting the prior low and practically only setting the “family”, ie. expected shape.

I’m still relatively inexperienced with pymc, but is there a way to weigh the prior distribution? Is that even necessary with mcmc sampling? (the “nu” parameter in R’s MCMCglmm package implies it is, yet that toolbox is completely weird and uninstructive in its prior definition and I would not trust it).

Both prior sensitivity analysis and weakly informative priors are valid and complementary approaches. I think the first is less common because, I have the impression, it is less well documented and understood. Maybe you can provide some tutorial on it? I did not search if there is one, but your workbook seems an excellent basis!

The discussion also has to do with human perception bias and good scientific philosophy. Take my four example animals: we observed that, at least in some cases, individuals would obviously run faster. If we had fed this expectancy to the model, we might have influenced the outcome. It turns that non-hierarchical approaches (we actually tested your OLS suggestion before) confirm this expectation, which I would now judge as a false positive. Those models are just wrong. The character we observe (locomotor speed) is inherently hierarchical: Individuals’ locomotor capacities are most appropriately described as a sample from a population distribution of capacities; whereas the measured speed on each run is then drawn from an individual distribution in a replicative manner. Taking the appropriate hierarchical model of reality allows you to visualize the probabilities on all levels (cf. Iknayan 2014), which is an enormous leap forward compared to box plots and hypothesis tests. Considering what Thomas described here (robust regression), bayesian models which are hierarchical and have an accurate “fat tail” prior are just excellent for many applications in Biology (the “science of exceptions”).



Iknayan, K. J., Tingley, M. W., Furnas, B. J., and Beissinger, S. R. (2014). Detecting diversity: emerging methods to estimate species diversity. Trends in ecology & evolution, 29(2):97–106.

You are right - the tuning of NUTS is also recently changed so it might give different behaviour compared to 3.0. If you are sure you can set the tune=0 with init='advi', then you will be using the mass matrix estimated from advi for NUTS (warning: ADVI is known to underestimated variance for parameters, and in many cases the wrong estimation for parameters).

As for the prior choice, I agree with you that in many application (e.g., Psychology or behavioural science) where your design matrix is experiment manipulation, and you don’t want to do feature selection, a weakly informed prior is usually the way to go. In this case, you are essentially saying “all the parameters are of interested and I want to put more weight on the data as much as possible”.
I am not sure what you mean about “weight the prior distribution”, but if you meant having different prior distribution and weighted them afterward, you can try model averaging in PyMC3 (fitting different model first, and then do prediction using a weighted version of all the fitted model).

Dear Junpeng,

in R’s “MCMCgllm” toolbox there is one argument “degree of belief parameter (nu)” in some prior configurations.
Taken from one of the tutorials in this series, I had remembered that (at least in analytical Bayesian calculations) the posterior is kind of a “mixture” (plain average or with mixing ratios) of prior and data. A higher amount of data pulls the posterior more towards it, but this could also be achieved by weighting the prior down (low “degree of belief” in my prior).

For example, I could just duplicate my data set (count each observation twice) and should get narrower posteriors. Right?


Hmm it looks like the degree of belief nu is just the degrees of freedom for the inverse-Wishart prior for covariance matrix. I have never use MCMCglmm so I could be wrong.

I dont think it is a good idea. In fact if it is done right you are just updating with the same data, which would give you the same posterior: P(A|B,B) = P(A|B)

Dear all,
sorry for reviving this old issue**.
I just wanted to let you know that the data and the model have finally made it to publication and can be found here:
(@aseyboldt asked for it above)
Thank you again for all the helpful comments!


** even worse: under a new account, because my previous one was linked to github and I deleted that recently.