Part of the reason I ask is that my research group is working on similar methods, and we’re curious how much users of PyMC want and/or have access to methods which run many parallel chains, since for many problems it can lead to very large wall-clock time speed ups (e.g. if burn-in time is small compared to the length of the chain).
PyMC has the DEMetropolis and DEMetropolisZ step samplers, which I think are in the spirit of ensemble samplers. I’m thinking about Emcee for example, which can also be used with PyMC if you just use pymc to get a logp function – see here for an out-of-date example that still rhymes with the modern API if you replace theano with pytensor throughout.
Do you have a sense of on what types of problems ensemble sampling out-performs HMC? That kind of info would be useful to guide users to want one of the other. Everything I’ve read basically says HMC is king, but admittedly I haven’t read much.
Thanks! Actually I’m imagining ensemble HMC settings (e.g. An Adaptive-MCMC Scheme for Setting Trajectory Lengths in Hamiltonian Monte Carlo). That is, you run N HMC (or other) chains in parallel, where N is the number of effective samples you want. Then all you need is to take a single sample from each chain after burn-in. If burn-in is fast, this is of course very very fast in wallclock time, but it seems like the stats community hasn’t really embraced these methods yet, and I’m trying to work out what the obstacles/downsides are.
With Stan, parallelization is not straightforward, but in e.g. jax it is, so if you can fit N copies of your chain on a GPU, you get huge speedups basically for free (and also can in theory be clever about tuning, by sharing info between chains during warmup).
Ah yeah this is super interesting. I’ve seen this paper which is thinking about how to assess convergence under this regime of many chains, few draws (and which cites the ChESS-MCMC paper you linked). Not sure how cloning tuned chains fits into everything – it seems like an obvious(ly good) idea, so it’s a bit shocking it’s not something that’s widely talked about. @aseyboldt or @bob-carpenter would have more learned opinions on the matter.
A good way to trigger a long response from an academic is to ask them about what they’re currently working on!
I followed Charles Margossian’s work on convergence analysis starting from when Charles was an intern with Matt Hoffman (the developer of NUTS). On the good news front, Charles finished his postdoc with us at Flatiron and is starting a faculty job at UBC in Vancouver in the fall. And he’ll be looking for grad students and postdocs interested in computational Bayesian statistics.
This all started when Matt delved into parallel GPU processing and said the optimal wall time is when we run massive parallel chains and then draw one sample from each. There’s a very nice summary of this line of thinking lined up for the 2nd edition of the MCMC Handbook, which is available as a draft on arXiv (it’s also a tutorial on JAX and has working code examples):
The parallelization bottleneck is convergence, so using cross-chain information like in Matt and Pavel Sountsov’s CheeS and MEADs samplers is great if it can improve convergence speed.
Using cross-chain info the way Matt and Pavel did has a fundamental drawback: all the tuning is global, whereas for complicated Bayesian models with varying geometry (when the Hessian [matrix of second derivatives] isn’t constant like it is for a multivariate normal). You might also want to dive into one of the earlier samplers, the affine invariant sampler of Goodman and Weare. The best reference I know is this new paper from Yifan Liu (formerly a postdoc at NYU and starting a faculty job in the fall at UCLA):
What’s really nice is that it doesn’t even need a mass matrix, and Yifan extends it to HMC.
I think we’ve cracked the step-size adaptation problem. We’re about to release a revision of our GIST sampler which tunes step size for whole MCMC trajectories to one that tunes step size every leapfrog step. The initial paper is out, but a better one is on its way next week on arXiv. This is very much how the ODE integrators work, but most of them are not symplectic (therefore not volume preserving and hence not zero Jacobian, hence you need more complicated acceptance, so they’re not so great for MCMC).
I think how to tune mass matrices locally is a problem we’re turning to next, but I don’t have any good leads on this. What I’ve implemented just this last couple weeks is a version that uses our variable step size, but with a continuous version of @aseyboldt’s Nutpie sampler (rather than working in blocks, it exponentially discounts the past and tunes the discount factor to be similar to what NUTS did without the blocking). The code is in good shape on this branch:
and the paper should be out on arXiv tis week. The plan is to wrap the code up like Adrian did for Nutpie and release it so that it’ll work for both Stan and PyMC models. Adrian’s initialization with absolute gradients and tuning using both draws and scores works really really well, as those who have tried Nutpie have seen. This is, unfortunately, also a global adaptation of mass matrix.
What we really need is local adaptation of the mass matrix. But we can’t afford cubic or really even quadratic time in high dimensions. Hesisan-vector products are linear, so there’s some hope, or maybe something low rank and diagonal? It’s an open research problem if anyone has any ideas or wants to join me, Nawaf Bou-Rabee, Sifan Liu, and Tore Kleppe in figuring this out.
P.S. It’s “cheese” MCMC not “chess”—Matt’s into food-based sampler names.
P.P.S. If anyone knows other references we should be looking at, please let us know.
What we really need is local adaptation of the mass matrix. But we can’t afford cubic or really even quadratic time in high dimensions. Hesisan-vector products are linear, so there’s some hope, or maybe something low rank and diagonal? It’s an open research problem if anyone has any ideas or wants to join me, Nawaf Bou-Rabee, Sifan Liu, and Tore Kleppe in figuring this out.
I see the fisher HMC normalizing flow sampler as exactly this: A way to get a locally changing mass matrix as we get in Riemannian HMC, and at the same time a nice way to make sure the leapfrog step stays cheap. We just never actually store the local mass matrix, it is just implied by the transformation. We could compute it by looking at the pullback of the metric through the normalizing flow if we wanted to.
Great point, @aseyboldt. I keep forgetting about more general methods than mass matrices. Not to mention that you have one in nutpie! If the flow has enough capacity and can be trained, a flow could perfectly condition a distribution back to standard normal.
Have you ever evaluated the condition number of the pullback applied to reference draws from a posterior? I think that’d give interesting insight into whether it’s succeeding in conditioning well everywhere. I say “reference draws” because you’d probably want to generate the draws independently from nutpie. I should have time to try this out myself in a few weeks—just wrapping up a bunch of stuff.
Have you ever evaluated the posterior generated from the flow itself? If it’s a perfect preconditioner, shouldn’t it generate independent draws from the posterior?
Whenever we’ve used flows for variational inference, we’ve followed it up with importance sampling, which has worked well to remove some of the flow artifacts that pop up (like narrow high energy bridges between modes with a unifodal standard normal source). But I’m really wondering these days if unadjusted methods may be good enough in a lot of cases. Importance sampling only works if you’re pretty close or if there are just a few unwanted regions that need to be removed or heavily reweighted.