Interruptible MCMC

I sometimes have to run models where inference takes a very long time. Even if I were willing to let the PC run for 50+ hours, unfortunately computational environments aren’t arbitrarily stable and will occasionally crash randomly. This got me thinking, how about an interruptible implementation of MCMC? Maybe allow the sampler to save it’s state periodically and allow the user to resume where it left of in the event of an interruption. I think it would also be very handy for long sessions, where it may be more convenient to occasionally “pause” sampling when system resources are needed elsewhere, and resume later.

P.S. I’ve never written an MCMC implementation myself, so I’ve no idea how easy or realistic this is, but based on the Markov property it seems possible

You can achieve some of these things by using McBackend in combination with its ClickHouseBackend.

However, the restarting of MCMCs is much harder than most people think, because you’ll have to take care of random number generator states and sampler tuning if you want to do it right. (Contributions welcome!)

Can you be more specific about what causes crashes in your model?

They are somewhat random. Broken pipe errors, device disconnects etc etc. This is intended more as discussion for a potential feature not some specific issue - these issues were the inspiration. As I mentioned, I think this feature would be handy even large models that are slow to sample. One could for example leave the computer to run MCMC when not otherwise preoccupied. Especially in research applications people usually run inference on PC’s / laptops.

I don’t known much about how MCMC is implemented under the hood, but from what I understand tuning concerns the mass matrix, which should be serializable. As for the random number generator states, I’d personally be willing to ignore them and their reproducibility. Maybe it could be an optional feature

In Lattice Quantum Field Theory, restarting a MCMC is perfectly standard. It is always done that way, because you send one or two day jobs to the batch system but generating your trace is a project which usually takes a few months or even years.

We often even restart on different supercomputers and different job geometries, therefore with different random generator states.

But we don’t have a mass matrix or any other auto-tuning. That’s where the complication is. We just restart from the same input file with the same integrator parameters.

So it’s not about general properties of MCMC but rather about the PyMC implementation.

Interesting. Well my knowledge of MCMC is pretty limited to M. Betancourt’s (primarily descriptive paper) but from my understanding the mass matrix has to be estimated at least for statistics applications. From my understanding the mass matrix is a matrix though. It can very well be serialized, pickled etc. Maybe we could have the interrupting behavior show up only after tuning if that’s the issue?