Pm.sample Parameters and Optimization

I’ve been using the framework for some time now, but I can’t find certain information in the forum or the examples section. Specifically, I need guidance on using pm.sample correctly.

I recently listened to the Learning Bayesian Statistics podcast with Charles Margossian, where they discussed MCMC and Variational Inference. They mentioned parameters like the warmup phase and initializations for the ODE, but I’m unsure how to optimize tunes, draws, and chains in pm.sample. I am referring to the default MCMC method (NUTS).

Additionally, they talked about running more chains instead of longer iterations. Does this make sense with NUTS?

I’m working on inferring large data problems where millions of parameters are estimated. Any advice would be greatly appreciated.

Millions of parameters is a lot. That is larger than any model I’ve even successfully seen in a pymc or stan model. (Those were maybe large 10_000s or small 100_000s of parameters).
And I don’t think playing a bit with the number of tuning steps will be all that helpful.
Before fitting a model that big I’d try to find ways to reduce the number of parameters. Can you maybe marginalize out some parameters? Or can you approximate that marginalization somehow? I’d also first try and make it work extremely well on smaller sizes, but often models start to break in new ways as you scale them up.

Thank you for your answer.
I can reduce the problem size to the small 100,000s of parameters, but my question regarding sampling parameters and diagnostics still remains.
Specifically, I’d like to know how to optimize tunes, draws, and chains in pm.sample, especially when using the NUTS method. Additionally, does running more chains instead of longer iterations make sense in this context?

Typically there isn’t really a good reason to change the defaults, if they don’t work, most of the time the model is the problem, not those parameters.

But in general: The number of tuning steps needs to be large enough that the sampler has time to first reach the typical set, and then learn the mass matrix and the step size. If you sample using nutpie, you can see for instance how it adapts the mass_matrix if you sample with store_mass_matrix=True, so you could for instance observe that it doesn’t converge quick enough. As you increase the number of tuning steps you improve (hopefully) the mass matrix the sampler is using after tuning, and that should lead to a higher number of effective draws / draw.

And about more chains vs more draws: I think that is mostly a question of robustness and some computational concerns.
Having more chains improves your diagnostics a bit I think: Given the same number of draws from more chains rather than fewer chains (ie 1000 draws each from 4 chains vs 500 draws from 8 chains each) should give you more opportunities to see if there is a chance that some of the chains might be failing. If the autocorrelation is high compared to the number of draws per chain I don’t know for sure what the consequences are, but I’d be skeptical about results if they come from a large number of chains that each possibly didn’t really converge, they might still have been initialized from a similar point after all.

And computationally you are investing additional tuning costs for each chain, but the advantage of more chains is that it is trivial to parallelize.

2 Likes