Long chain diverges right at transition between warmup and sampling

Hi,

I am puzzled by how a relatively long chain (4000 warmup + 4000 samples) diverges right after warmup. I am not talking about the formal divergences (there are around ~500s, many, so I just masked them here), but you can see from the trace that while a relative (at least visually) stability sets in after a 100 iterations or so, the transition from warmup to tuning brings about a shock that makes it fail catastrophically:


I know I have to reparameterize that model, or at least increase the acceptance target to something closer to 1, but I nevertheless would like to understand this phenomenon !
I suppose that what’s happening is that the constantly tuning step-size during warmup prevents a complete break-down, which then becomes runaway when the step size is fixed. Would that be OK (though far from ideal) to just use the warmup data (throwing away the first 100s draws) with tunable step size, or does a variable step introduce a fundamental bias?

Thanks for shedding some light into this !

I don’t think that things can “fail” during tuning. The sampler is exploring and so as long as nothing catostrophic happens (e.g., wandering into portion of the parameter space that yield invalid log posteriors) then it happily continues to explore. But once it starts sampling, it needs to nail down the sampling parameters and, if tuning has not gone well, things will go poorly. You can see that tuning is not going well for 3 of your 4 parameters because, despite exploring range of values, the chains are not well-mixed, even during tuning. Once the sampling parameters are fixed, the variability in sampled values decreases and you can more easily see that the chains are sitting in different parts of the parameter space.

That’s just my intuition. Someone else might have a more technical take on what you’re seeing.

Thanks @cluhmann for sharing your insights. I find it hard to comprehend from an algorithmic point of view though, because I thought tuning and sampling were doing nearly the same thing, the only difference being the adaptive step size. But apparently that’s what makes all the difference between exploring and getting stuck. I didn’t mention in my original question that the curves are not different chains but different coordinates that are being sampled simultaneously. Unfortunately something is preventing me for running multiple chains (maybe because of a custom step for one of the parameters). I’m still curious to know, were I able to run multiple chains, and were the chains well mixed, could I use the we behaved part of the warmup chain or does the varying step introduce some bias in itself?

Ya unfortunately best not to use the warmup chain(s) for estimation. The fact that the step size is adapting during warmup breaks the properties that help to ensure the sampler is approximating the posterior. Gelman et al put it this way:

As with MCMC tuning more generally, any adaptation can go on during the warm-up
period, but adaptation performed later on, during the simulations that will be used for
inference, can cause the algorithm to converge to the wrong distribution. For example, suppose we were to increase the step size ’ after high-probability jumps and decrease ’ when the acceptance probability is low. Such an adaptation seems appealing but would destroy the detailed balance (that is, the property of the algorithm that the flow of probability mass from point A to B is the same as from B to A, for any points A and B in the posterior distribution) that is used to prove that the posterior distribution of interest is the stationary distribution of the Markov chain.
pg 304 of Bayesian Data Analysis

Your best course of action is to reparametrize and also try to locate the part of the parameter space that is causing your chain to blow up. I found this notebook really helpful for trying to understand funky chain behavior:

3 Likes

That’s what I feared :wink:

I came across a version of this notebook but am only starting to look at it in details. Great idea to go and pin-down where the divergences occur (in my case that won’t be easy because of dimensionality, AR(1) processes and LKJCholesky are involved among other things!). If anything interesting I’ll report back here. Otherwise I’d say the question as originally formulated is resolved.

Thanks for your replies @daniel-saunders-phil and @cluhmann.

1 Like