Adaptive MC w/ normalizing flows

New paper turned up on PNAS about normalizing flows in Monte Carlo,

May be of interest?

2 Likes

Cool! It seems related (in spirit) to the NeuTra HMC paper from a few years ago ([1903.03704] NeuTra-lizing Bad Geometry in Hamiltonian Monte Carlo Using Neural Transport). My 2 cents: attacking difficult posterior distributions this way is really helpful when you have moderate dimensionality (100 < d < 10000) and high correlations. My (limited) knowledge on this is that the compute cost incurred by the NF is at least linear in d so the payoff might not always be worth it.

Unfortunately, this is also kind of a mismatch to the problem space many advanced modelers face because you either have something with d >> 10000 like regression or Bayesian machine learning with many groups / covariates, or you have small d with extremely difficult posterior geometries from highly nonlinear models like ODE/PDEs from physics/ecology/etc in which case you can use Riemannian HMC.

1 Like

Thanks for the good remarks. I agree that it’s not a giant hammer to squash all problems. But I suspect in complex models it could help address certain difficult or inexistant reparametrizations.

Slightly tangential question, but how often you you think sampling difficulty arises from multi-modality, as opposed to difficult or degenerate posterior geometry? In my own work (economics) I see the latter much more often than the former, although I admittedly don’t work much with mixture models.

Slightly tangential question, but how often you you think sampling difficulty arises from multi-modality, as opposed to difficult or degenerate posterior geometry? In my own work (economics) I see the latter much more often than the former, although I admittedly don’t work much with mixture models.

Multimodality is definitely less of an issue than it used to be, since the implementation of features like the ordered transform and mixture classes have made label switching much less common. Variational inference has also been pretty helpful too, since it typically just picks a single mode for the approximate posterior though that’s more or less just shoving the issue under the rug so it doesn’t show up in the convergence diagnostics.

1 Like

Marylou was a postdoc here at Flatiron Institute in the same department as me, so I got to see this work evolve up close. The basic idea is that it samples around the modes using something like HMC or MALA, then uses the draws to train a normalizing flow, which is then used to generate Metropolis proposals.

@jessegrabowski: Marylou works in molecular dynamics where the different modes typically consist of different stable configurations of molecules. Although a molecule might have hundreds of atoms with three degrees of freedom each (plus momenta), in reality the dimensionality is typically reduced to something manageable like the angle between two functional binding sites in a molecule.

@ckrapu: This is not particularly related to Matt’s work on normalizing flows. Matt was using the flows as a preconditioner. I was very keen on putting this work into Stan and then visited Matt who told me it only worked due to graduate student ascent and couldn’t be generalized easily to new problems.

Your comment about ordering applies to building easy mixture models, for which the posterior won’t be multimodal. This doesn’t help for sampling multimodal posteriors. High-dimensional Gaussian mixture models are also incredibly multimodal both in terms of data density and posterior density over parameters. Neural network regression has the same issue.

P.S. The most impressive work I know on actually fitting normalizing flows to posteriors (at which point you don’t need MCMC, only a bit of importance sampling), is by Agrawal, Sheldon, and Domke: [2006.10343] Advances in Black-Box VI: Normalizing Flows, Importance Weighting, and Optimization

They go over the tricks you need to actually make ADVI work. We haven’t gotten around to putting these in Stan yet. I’m way more excited about Justin’s work on normalizing flow-based fits. He was here visiting Flatiron for 5 months and we didn’t find models it couldn’t fit. For centered parameterizations of high-dimensional regressions (like a hierarchical IRT 2PL model with funnels as well as additive and multiplicative non-identifaiblity), it outperformed NUTS. The importance sampling is important as the flows themselves tend to have artifacts, especially if you’re fitting something multimodal with a single flow.

4 Likes