What are the differences between NUTS and ADVI?

Adding my two cents having used the PyMC3 implementations of both on a range of nonstandard problems:

  1. When it is said that ADVI is “less accurate”, it is often made as a general comment about the inability of all variational inference methods to properly characterize posterior variance. For example, the ADVI posterior for a regression coefficient will often be too narrow / or concentrated. This is because of the fact that VI methods are minimizing a loss function which asymmetrically favors using too-narrow approximations.

  2. It has often occurred to me when working with weird models with very difficult posterior geometries such that all MCMC methods fail, initializing first with ADVI can help get the samplers to work. I suspect that this is because the initial values for the MCMC sampler may be poorly chosen, leading to horrible numerical issues for Hamiltonian Monte Carlo; it appears that in many cases ADVI is somewhat more robust.

  3. Some problems are just too big for NUTS (even with a GPU) and ADVI is the only option for model fitting. I’ve used ADVI + GPU to train deep convolutional autoencoders with 10 million+ parameters and Bayesian regularization using the minibatched implementation of ADVI within PyMC3.

As a short summary, I think the worst use case for ADVI would be with a small dataset and complicated model structure while it is best used for very large models and datasets requiring minibatched computation.

5 Likes