My model is giving lots of divergent samples, about 60% of all samples diverge. And yet the resulting distributions for the target variables are quite accurate; even with only 600 samples, the standard deviations are quite tight and the reality is well within 1 sigma everywhere. And everything looks good: fuzzy caterpillars, good exploration of the parameter space, etc.
@_eigenfoo says “if you have even one divergent chain, you should be worried”. Given the accurate results—and the fact that everything else looks good—should I worry?
Some details: hierarchical model with lots of beta distributions. NUTS. target_accept=0.95
1 Like
That definitely is a cause for worry, and quite unusual as well.
Hard to tell from distance without model or data what the problem might be. Do you have anything in your model that might not be differentiable (eg a changepoint or so)?
You can have a look at a parallel plot to try and find regions of the posterior where the divergences happen: https://arviz-devs.github.io/arviz/examples/matplotlib/mpl_plot_parallel.html
5 Likes
Hi David,
In addition to Adrian’s advice, you could also try running your chains for more draws. Maybe then they will really diverge and the results won’t be as good as they seem now. This would inform you about the regions and parameters with divergences.
Hope this helps
2 Likes
Everything is differentiable. (Famous last words.)
az.plot_parallel() is not so useful in this situation, as the many divergent plots hide the non-divergent behind. For example:
I tried finding the RVs in which there is the biggest difference between divergent and nondivergent, across the samples, normalized by the mean and variance of the nondivergent. Here is a scatterplot of two RVs with large Z score differences, 0.6 difference for winnable_p and 0.43 for naive_perceived.
The grey are the nondivergent and the green are the divergent. The greens tend to be northwest of the greys, but only by a bit.
The right graphic shows the divergent leapfrogs, in the style of Diagnosing Biased Inference with Divergences. (Only the first 198 divergent leaps are shown, as only the first 198 are stored in _chain_warnings.)
There are leaps northwest, southwest, southeast and northeast. No obvious pattern.
1 Like
Thanks Alex.
Sampling for 4 chains, 4000 draws, and 1000 tunes (instead of 2/600/500) improved the results a bit, and also resulted in many divergences.
Many of the RVs are beta distributions. Betas are of course defined on [0, 1]. It looks like most of the divergent leapfrogs are leaping to either 0 or 1. For example, here are two RVs, with a lot of red dots either on the west border or the north border.
Could a leap to an extreme point in the beta range cause a divergence?
1 Like
Something related to that could be the problem.
Can you share the model?
I’d try to look at the transformed variables in this case. eg winnable_pursued_prop_interval__
or something like that.
1 Like
Following up on a suggestion you made elsewhere @aseyboldt, I took a look at the trace.report._warnings, and in particular at whether the logodds of the beta RVs were overflowing or underflowing.
Here are the beta values and corresponding logodds of the minimum value generated for any beta RV in the first ten divergences:
And here are the beta values and corresponding logodds of the maximum value generated for any beta RV in the same first ten divergences:
The maxes look suspiciously similar, almost identical. Is it really possible that they would all be almost the same? (I think these maxes are not all from the same RV, but I have not yet checked that suspicion.)
In case it is useful to someone in the future spelunking the warnings, here is my code to find the mins and maxes:
1 Like
That looks like the right place to look.
At what point the beta dist logp starts overflowing also depends on the alpha and beta parameters. You can try with something like
pm.Beta.dist(alpha=??, beta=??).logp(value).eval()
If that in fact turns out to be the problem, I think you’ll have to rethink your model a bit. First: Are probabilities this large (or small) really realistic in your context? eg 1e-84 is not a probability I would expect to see outside of problems from combinatorics.
So why does your model seem to think that that can happen?
Instead of using beta distributions, you could for example model the logit prob directly, and use normal distributions on the logit scale.
1 Like
1e-84: point taken.
Interestingly the problem is not with the minimums. Those absurd minimums were all from a single RV. As you suggested, I reparameterized, albeit in a different way than you suggested. The offending beta was eliminated, resulting in far more reasonable minimums:
But the new model still diverges for 60+% of the samples.
I am now pursuing the unreasonable maximums, all very close to 1.0, and suspiciously similar in value: all with logodds greater than 18.85 and less than 19.0.
The problem is in fact with the maximums. I found the biggest offender, the RV that most frequently had logodds close to 19.0, and reparameterized it as the probability of the event not occurring instead of the probability occurring. As I understand it, that will change a logodds of 18.9 (e.g.) to a logodds of -18.9.
The workaround was successful. The divergences are reduced by an order of magnitude, and some of those that remain are with other beta RVs that see logodds close to 19.0.
But this workaround is deeply unsatisfying. Why does a beta RV with alpha=3.06 and beta=0.25 (e.g.) sometimes generates a divergent sample while a beta RV with alpha=0.25 and beta=3.06 never diverges? Why does logodds of 19.0 cause a divergence while a logodds of -19.0 never does?
Is there a bug in pm.Beta() that is causing this behavior? Or perhaps I am misunderstanding something?
I isolated the problem and submitted an issue.
2 Likes
This does seem to be a theano issue at the root:
The sigmoid implementation is a bit strange:
return """%(z)s = %(x)s < -709.0 ? 0.0 : %(x)s > 19.0 ? 1.0 : 1.0 /(1.0+exp(-%(x)s));""" % locals()
Why 19? That expression evaluates to something different from 1 until ~36.6.
And then on top of that the stability enhancing code seems to work in the case of log1p(-sigmoid(x)):
import theano
import theano.tensor as tt
x_ = theano.tensor.dscalar('x')
x_.tag.test_value = np.array(1., dtype='d')
f = theano.function([x_], tt.log(tt.nnet.sigmoid(x_)))
g = theano.function([x_], tt.log1p(-tt.nnet.sigmoid(x_)))
theano.printing.debugprint(f)
theano.printing.debugprint(g)
Elemwise{Composite{(-scalar_softplus((-i0)))}} [id A] '' 0
|x [id B]
Elemwise{Composite{log1p((-scalar_sigmoid(i0)))}} [id A] '' 0
|x [id B]
Even though both supposedly should be using softplus:
424:log1p_neg_sigmoid = gof.PatternSub(
425- (tensor.log1p,
426- (tensor.neg, (sigmoid, 'x'))),
427- (tensor.neg, (softplus, 'x')),
428- values_eq_approx=values_eq_approx_remove_inf,
429- allow_multiple_clients=True)
2 Likes
On top of that, the pymc impl of invlogit doesn’t seem that great either:
def invlogit(x, eps=sys.float_info.epsilon):
"""The inverse of the logit function, 1 / (1 + exp(-x))."""
return (1.0 - 2.0 * eps) / (1.0 + tt.exp(-x)) + eps
Theano has its own ideas about what to do with this:
h = theano.function([x_], pm.math.invlogit(x_))
theano.printing.debugprint(h)
Elemwise{Composite{(i0 + (scalar_sigmoid(i1) * i2))}} [id A] '' 0
|TensorConstant{2.22044604..250313e-16} [id B]
|x [id C]
|TensorConstant{0.9999999999999996} [id D]
We should fix the theano sigmoid and use that in invlogit directly I think…
2 Likes
Aha! 19.0 shows its ugly head.
1 Like