Divergences with unused variables

Newbie question here… Is it normal to get divergences even when just sampling from two distributions? what does it mean?

Here is the code I used:

import pymc as pm

with pm.Model() as model:
    mu = pm.Normal("mu", mu=0, sigma=20)
    sigma = pm.HalfNormal("sigma", sigma=20)

    idata = pm.sample()

and this is the output I get

Am I doing something wrong or this has to be expected? Even raising target_accept=0.95 I still get at least one divergence.

I checked this question already, but the answer was to use the problematic distribution as a prior (rather than letting it “unused”, but this looks confusion to me, meaning I don’t see why we shouldn’t be able to sample from a “free” distribution) and setting the mu, sigma parameters to push if further from zero. Does it mean every time I have a high probability mass on one value, e.g. 0, I will get divergences?

Thanks in advance for any comment!

See these comments by @aseyboldt. Priors aren’t necessarily easier to sample than posteriors via MCMC.

This example is quite interesting though. I think it’s just the half-normal; the fact that there are two independent distributions is (i think) not relevant. I guess the boundary of the half-normal isn’t easy for NUTS? This model samples fine:

import pymc as pm

with pm.Model() as model:
    mu = pm.Normal("mu", mu=0, sigma=20)
    signed_sigma = pm.Normal("signed_sigma", sigma=20)
    sigma = pm.Deterministic('sigma', pm.math.abs(signed_sigma))

    idata = pm.sample()

Yeah the problem is with the mode/large mass at the edge of the support space in the HalfNormal

thank you very much for your answers @ricardoV94 and @jessegrabowski , they make perfectly sense. But I wonder. Having a mode at boundary (zero in this case) for a prior/posterior distribution should be a pretty common case. Take for example the non-centered 8-school problem formulation. \tau is sampled according to a halfCauchy distribution here:

and these are the posteriors of the parameters (I guess you can ignore variables other than tau in the picture)

Shouldn’t I end up with divergences then in this case (and all other similar cases)?

Looks like you ended up with divergences. Even a single divergence is indicative of problems. Sometimes it may look fine but if you sample for longer/more chains/different random seed they show up.

In most cases those kind of priors are fine as long as the posterior doesn’t have a lot of mass at the boundary, since what’s being sampled is the posterior not the prior.

Ok I see. Then I guess one can try a Half-Cauchy/Half-Normal prior distributions and, if divergences happen because of the posterior having a mode at the boundary, one can start looking into boundary-avoiding kind of priors. Thanks to your comment, I also noticed that actually both in the Stan and Pymc documentation about the non-centered formulations they keep getting “false positive” divergences. I guess something else I could do is to check whether the same applies in my case.

Let me know if I am overthinking/overcomplicating this, but it starts to make sense. Thanks again for your comments!

1 Like

I think you’re on the right track.

However, what do you mean by false positive divergences?

I’m not a stats expert by any means, but I always thought the half-cauchy priors on variance were silly and overkill. Same goes for exponential. Have you ever actually drawn samples from these? They’re nuts.

I’ve been defaulting to Gamma(2, x) for strictly positive variables for a while now, and I haven’t felt the need to look back. It avoids these problems by explicitly ruling out zero. Sometimes when it’s not possible, or when I know ahead of time the variance is going to be close to zero, I’ve had good success switching to modeling the precision instead of the variance. I don’t exactly know what PyMC does under the hood to convert between parameterizations, but the sampler seems to be happier going after big numbers in the tail then flipping them.

1 Like

I have also switched to this. The idea of nearly-zero scale parameters doesn’t seem sensible as a general rule.

2 Likes

From here

As shown above, the effective sample size per iteration has drastically improved, and the trace plots no longer show any “stickyness”. However, we do still see the rare divergence. These infrequent divergences do not seem concentrate anywhere in parameter space, which is indicative of the divergences being false positives.)

and similarly here :

These infrequent divergences do not seem concentrate anywhere in parameter space, which is indicative of the divergences being false positives. As expected of false positives, we can remove the divergences entirely by decreasing the step size,

1 Like

yes, i tried sampling from the half-cauchy and I was surprised as well! :sweat_smile: thanks a lot for the comment, super interesting!

We should probably avoid such language in the example notebooks (divergences always indicate that the sampler is having issues, but sometimes you might care more or less about these issues). Just the sort of thing we hope to work on at this week’s docathon! Feel free to join in if you’re interested/available.

1 Like