Chains converge to local optima?

Hi everyone,
I’m running a fairly complex model, that works for most of my input data but sometimes I get output similar to this post, where two chains converge to different results than the other two. Also the two chains look quite correlated, maybe due to using the same initial values for some of the parameters.

The model is not inherently multi-modal (i.e. there is no multimodality in the prior, like in the example). My question is if there is any intuition behind this behavior in the HMC/NUTS setting. In standard Metropolis-Hastings I’d suspect that two of the chains get stuck in local optima. Can this happen with NUTS as well? Clearly, two of the chains are in a “better place”, when looking at the logp values (right plot). Do you have any suggestions how to approach this? I tried reducing the target acceptance rate (in the plot it was 0.9, tried going down to 0.8) but the problem persists…

Thanks a lot,
Arthus

Yes. Or at least conceptually it’s the same idea.

This would be the opposite of what you probably want. You want to increase the acceptance rate so that the sampler can more easily explore the space. Decreasing the acceptance makes it more difficult.

Here is a nice animation of how NUTS will behave in a multi-modal setting. You can play with the target_accept by adjusting the “leapfrog \Delta t” parameter (higher target_accept implies a lower step size)

I would also check the log probability of your data under each of the modes in your posterior. If they are roughly equal, you have an identification problem. If they’re not, it might be something else. The lack of divergences makes me pretty confident it’s identification though.

Thank you for your replies! I played with the (very nice) animation, but with the naive NUTS I couldn’t really identify if a larger or smaller \Delta t explores the multimodality better. From MH I would expect the opposite of what @cluhmann stated though, i.e. smaller stepsize (= higher target acceptance) means getting stuck more easily. Am I wrong here?

Edit: With the efficient NUTS it was clearer that the higher target acceptance and smaller stepsize leads to better exploration of the multiple modes. I still don’t understand why though…

Regarding the log posterior. The chain plots above are the logp values from the inference_data output. Is this what you meant? I have no good intuition of what “roughly equal” means in this sense, but as it’s a log space I wouldn’t think ~30500 and ~30800 are “roughly equal”…

I think target accept and multi-modality are largely (not entirely) separate issues. Small step sizes just explore difficult posteriors better in general. They adhere to the surface of the posterior better so chains can crawl through low-probability regions and make it over to a new mode.

I wouldn’t get too hung up on that though. It’s a losing battle to fix multi-modality by fiddling with the sampler’s parameters. The best option is to think mechanically about what the two modes represent in terms of your target system. Do both modes make sense given domain knowledge? If not, you might try devising priors or changes to the model’s structure that force the posterior away from the implausible mode. One thing that sometimes helps in case of multi-modality is an order constraint.

The second best option is to switch inference algorithms. I don’t have a ton of expertise here but smc is often mentioned. A surprising amount of the blackjax sampling book is about algorithms that work for multi-modal posteriors. The Sampling Book Project — The Sampling Book Project

p.s. yeah I’d agree two of your chains are better and it’s a fairly wide gap.

1 Like

Thanks everyone! Even though I don’t understand what is exactly going on, resorting to much stronger priors seems to solve the issue.

1 Like

There are two things going on. One, your posterior has a local mode. This isn’t the fault of the sampler—it’s something going on with the combination of model and data.

There are two situations here we can distinguish. In the first, there are two modes that both contribute non-negligible probability mass. That is, if we took independent draws, we’d get draws from both modes. In the second, there are “minor modes” that contribute negligible probability mass and would never show up with independent samples from the posterior. Those you want to get rid of somehow by tightening the model (prior and/or likelihood) or using a more robust sampler.

First, nothing works on multi-modal posteriors in general—it’s an NP-hard problem.

SMC will just remove minor modes during resampling. This can be a big problem for SMC if there are unbalanced modes, like one with 99% of the posterior mass and one with 1%. In that case, it’s hard to keep the 1% mode from not disappearing without a whole lot of particles. And you still need to find the modes to initialize, either with very careful annealing or explicit initialization. You won’t find widely separated modes in SMC during updates as it relies on usual sampling at that point.

2 Likes