Interruptible MCMC

Thops · July 12, 2023, 7:11am

I sometimes have to run models where inference takes a very long time. Even if I were willing to let the PC run for 50+ hours, unfortunately computational environments aren’t arbitrarily stable and will occasionally crash randomly. This got me thinking, how about an interruptible implementation of MCMC? Maybe allow the sampler to save it’s state periodically and allow the user to resume where it left of in the event of an interruption. I think it would also be very handy for long sessions, where it may be more convenient to occasionally “pause” sampling when system resources are needed elsewhere, and resume later.

P.S. I’ve never written an MCMC implementation myself, so I’ve no idea how easy or realistic this is, but based on the Markov property it seems possible

michaelosthege · July 12, 2023, 12:16pm

You can achieve some of these things by using McBackend in combination with its ClickHouseBackend.

However, the restarting of MCMCs is much harder than most people think, because you’ll have to take care of random number generator states and sampler tuning if you want to do it right. (Contributions welcome!)

Can you be more specific about what causes crashes in your model?

Thops · July 12, 2023, 1:10pm

They are somewhat random. Broken pipe errors, device disconnects etc etc. This is intended more as discussion for a potential feature not some specific issue - these issues were the inspiration. As I mentioned, I think this feature would be handy even large models that are slow to sample. One could for example leave the computer to run MCMC when not otherwise preoccupied. Especially in research applications people usually run inference on PC’s / laptops.

I don’t known much about how MCMC is implemented under the hood, but from what I understand tuning concerns the mass matrix, which should be serializable. As for the random number generator states, I’d personally be willing to ignore them and their reproducibility. Maybe it could be an optional feature

julien · July 12, 2023, 6:42pm

In Lattice Quantum Field Theory, restarting a MCMC is perfectly standard. It is always done that way, because you send one or two day jobs to the batch system but generating your trace is a project which usually takes a few months or even years.

We often even restart on different supercomputers and different job geometries, therefore with different random generator states.

But we don’t have a mass matrix or any other auto-tuning. That’s where the complication is. We just restart from the same input file with the same integrator parameters.

So it’s not about general properties of MCMC but rather about the PyMC implementation.

Thops · July 13, 2023, 7:43am

Interesting. Well my knowledge of MCMC is pretty limited to M. Betancourt’s (primarily descriptive paper) but from my understanding the mass matrix has to be estimated at least for statistics applications. From my understanding the mass matrix is a matrix though. It can very well be serialized, pickled etc. Maybe we could have the interrupting behavior show up only after tuning if that’s the issue?

Topic		Replies	Views
Resuming sampling from a previous trace v5	12	1595	September 1, 2023
Stopping and restarting sample_smc Questions sampling	6	969	January 27, 2022
Restarting sampling from stored multitrace: mixed sampling and tuning Questions	2	572	October 5, 2020
Complaint Monday - What has been bothering you about PyMC? Development development	7	602	June 19, 2023
Saving intermediate results using MCMC in pyMC4 v5	9	1553	August 8, 2022

Interruptible MCMC

Related topics