I notice the the default initialization procedure is no longer ADVI
but jitter+adapt_diag
. Could anyone explain this is in simple terms and why it is preferred to initializing using ADVI?
With adapt_diag
enabled, nuts changes the (diagonal) mass matrix during tuning to match the variance of the posterior samples so far (using sliding windows so that early nonsense doesn’t mess everything up). In cases where the solution of advi is very different from the actual posterior, this can improve mixing a great deal. Basically, adapt_diag is more robust than advi.
We also noticed that running both advi and mass matrix adaptation isn’t worth it most of the time, especially when taking into account the compilation time for advi. In some very large models this might be different, if so, you can set it to advi+adapt_diag
.
If you know cases where the effective number of samples (the min of pm.effective_n(trace)
divided by the total time) decreased because of this change, we’d be interested to hear about it.
The jitter
implies that the start position for parameters where it isn’t specified explicitly is set to uniform(-1, 1)
on the transformed space, so that different chains use different initial parameters.
Thanks for the explanation!
I can actually test whether its worthing advi+adapt_diag
because I’m running a very large model at the moment. I’ll set up another version going and report back once done.
Turns out that for my models at least (which are huge), it is definitely worth initializing with ADVI. In fact without ADVI, I often get the “bad initial energy” error at some point during sampling.