QuadPotentialFullAdapt

aarcher · June 23, 2022, 10:03pm

Hi,

I’ve gotten larger effective samples sizes and fewer number of divergences when using jitter+adapt_full than jitter+adapt_diag. However, I’m a bit nervous about using jitter+adapt_full because of the warning “QuadPotentialFullAdapt is an experimental feature.” Is it possible for anyone to speak more about QuadPotentialFullAdapt and potential pitfalls in using this feature?

Thank you in advance!

ricardoV94 · June 24, 2022, 6:34am

If you want to try something more experimental and even better, have a look at jitter+adapt_diag_grad pymc.init_nuts — PyMC dev documentation

For your general question, perhaps @aseyboldt can clarify you.

aseyboldt · June 24, 2022, 7:52am

Full mass matrix adaptation is not somehow dangerous, maybe we should actually get rid of that warning by now. Just keep in mind that this doesn’t scale well to high dimensional cases. (At least in the current implementation).
We might want to change the exact tuning behavior of that method relatively soon however, I’d hope that it only gets better if we do so though.

Because Ricardo mentioned it:
There are a couple of other mass matrix adaptation methods that I’m working on. For one, jitter+adapt_diag_grad. As far as I currently understand things, this is usually better than our default, at least in terms of effective sample size per number of gradient evaluations. Just like the default this also scales well to high dimensions. It still only adapts a diagonal mass matrix, so it can’t really solve cases with high correlations that much better than the default, and in a low dimensional setting jitter+adapt_full might still be better.
For the high dimensional case there is a completely new mass matrix adaptation algorithm in covadapt (cell 14), this really is very experimental though.
In addition, there’s also a slightly modified implementation of jitter+adapt_diag_grad in nutpie that should converge faster to the same thing that jitter+adapt_diag_grad does.

jessegrabowski · June 24, 2022, 9:02am

Maybe a naive question, but why isn’t there more freedom to “mix and match” the components of initialization methods for NUTS? It seems like all the available initialization methods boil down to some method of initializing the mass matrix (initial values, initial values + jitter, or ADVI), paired with a Potential function (diag, full, diag_grad). Is there any reason why one couldn’t (or shouldn’t) start with the ADVI diagonal and run adapt_diag_grad, or why use SVGD or Normalizing Flows instead of ADVI to get an initial diagonal estimate?

junpenglao · June 24, 2022, 9:19am

In the past we are using ADVI as default, but that turns out does not work as well. So more focus now is using Stan-like window adaptation, but with different strategy of how we estimate the covariance (mass matrix) of the posterior during warmup.
But that could also be that we have not spend enough effort to test out these ideas! For example, even with quasi-Newton algorithm there are some new idea that might improve tuning (i.e., Pathfinder)

jessegrabowski · June 24, 2022, 9:44am

I’ve had success with using ADVI in models where the parameter space is somewhat constrained, directly modeling probabilities where all observations were small (<0.05) with a LogitNormal. Using jitter caused chains to fail, while ADVI went smooth. I’d say it’s a nice tool to have, even if it’s not a nice default.

I guess my question is more theoretical than practical – there are several ways to initialize the mass matrix, and several potential functions. Aside from wanting to avoid analysis paralysis, is there any reason why a user shouldn’t be able to try the full combinatorial range of options?

aarcher · June 24, 2022, 3:24pm

Thank you all for your detailed response and interesting discussion! I’ll let you know if I have any more questions.

qipengchen · July 2, 2022, 2:46am

Hi, do you have any reference recommendations on the advi+adapt_diag initialization method? I recently used jitter+adapt_diag as the initialization method, however, it causes the chain failed, but advi+adapt_diag works fine instead. I would like to know more about the advantages and disadvantages of advi as an initialization method. Thanks a lot !

Topic		Replies	Views
What exactly is `jitter+adapt_diag` and why is it the default now? Questions	3	3433	November 3, 2017
Initialization energy is NaN or Inf with jitter Questions	4	1320	December 9, 2020
Difference between 'jitter+adapt_diag' and 'adapt_diag'? Questions	2	1244	April 2, 2019
Derivatives are zero for jitter but not ADVI initialization Questions	2	542	September 28, 2018
How to interpret the outcome of adapt_full	0	396	July 12, 2023

QuadPotentialFullAdapt

Related topics