I’ve gotten larger effective samples sizes and fewer number of divergences when using jitter+adapt_full than jitter+adapt_diag. However, I’m a bit nervous about using jitter+adapt_full because of the warning “QuadPotentialFullAdapt is an experimental feature.” Is it possible for anyone to speak more about QuadPotentialFullAdapt and potential pitfalls in using this feature?
Thank you in advance!
If you want to try something more experimental and even better, have a look at
jitter+adapt_diag_grad pymc.init_nuts — PyMC dev documentation
For your general question, perhaps @aseyboldt can clarify you.
Full mass matrix adaptation is not somehow dangerous, maybe we should actually get rid of that warning by now. Just keep in mind that this doesn’t scale well to high dimensional cases. (At least in the current implementation).
We might want to change the exact tuning behavior of that method relatively soon however, I’d hope that it only gets better if we do so though.
Because Ricardo mentioned it:
There are a couple of other mass matrix adaptation methods that I’m working on. For one,
jitter+adapt_diag_grad. As far as I currently understand things, this is usually better than our default, at least in terms of effective sample size per number of gradient evaluations. Just like the default this also scales well to high dimensions. It still only adapts a diagonal mass matrix, so it can’t really solve cases with high correlations that much better than the default, and in a low dimensional setting jitter+adapt_full might still be better.
For the high dimensional case there is a completely new mass matrix adaptation algorithm in covadapt (cell 14), this really is very experimental though.
In addition, there’s also a slightly modified implementation of
jitter+adapt_diag_grad in nutpie that should converge faster to the same thing that
Maybe a naive question, but why isn’t there more freedom to “mix and match” the components of initialization methods for NUTS? It seems like all the available initialization methods boil down to some method of initializing the mass matrix (initial values, initial values + jitter, or ADVI), paired with a Potential function (diag, full, diag_grad). Is there any reason why one couldn’t (or shouldn’t) start with the ADVI diagonal and run adapt_diag_grad, or why use SVGD or Normalizing Flows instead of ADVI to get an initial diagonal estimate?
In the past we are using ADVI as default, but that turns out does not work as well. So more focus now is using Stan-like window adaptation, but with different strategy of how we estimate the covariance (mass matrix) of the posterior during warmup.
But that could also be that we have not spend enough effort to test out these ideas! For example, even with quasi-Newton algorithm there are some new idea that might improve tuning (i.e., Pathfinder)
I’ve had success with using ADVI in models where the parameter space is somewhat constrained, directly modeling probabilities where all observations were small (<0.05) with a LogitNormal. Using jitter caused chains to fail, while ADVI went smooth. I’d say it’s a nice tool to have, even if it’s not a nice default.
I guess my question is more theoretical than practical – there are several ways to initialize the mass matrix, and several potential functions. Aside from wanting to avoid analysis paralysis, is there any reason why a user shouldn’t be able to try the full combinatorial range of options?
Thank you all for your detailed response and interesting discussion! I’ll let you know if I have any more questions.
Hi, do you have any reference recommendations on the
advi+adapt_diag initialization method? I recently used
jitter+adapt_diag as the initialization method, however, it causes the chain failed, but
advi+adapt_diag works fine instead. I would like to know more about the advantages and disadvantages of advi as an initialization method. Thanks a lot !