Marylou was a postdoc here at Flatiron Institute in the same department as me, so I got to see this work evolve up close. The basic idea is that it samples around the modes using something like HMC or MALA, then uses the draws to train a normalizing flow, which is then used to generate Metropolis proposals.
@jessegrabowski: Marylou works in molecular dynamics where the different modes typically consist of different stable configurations of molecules. Although a molecule might have hundreds of atoms with three degrees of freedom each (plus momenta), in reality the dimensionality is typically reduced to something manageable like the angle between two functional binding sites in a molecule.
@ckrapu: This is not particularly related to Matt’s work on normalizing flows. Matt was using the flows as a preconditioner. I was very keen on putting this work into Stan and then visited Matt who told me it only worked due to graduate student ascent and couldn’t be generalized easily to new problems.
Your comment about ordering applies to building easy mixture models, for which the posterior won’t be multimodal. This doesn’t help for sampling multimodal posteriors. High-dimensional Gaussian mixture models are also incredibly multimodal both in terms of data density and posterior density over parameters. Neural network regression has the same issue.
P.S. The most impressive work I know on actually fitting normalizing flows to posteriors (at which point you don’t need MCMC, only a bit of importance sampling), is by Agrawal, Sheldon, and Domke: [2006.10343] Advances in Black-Box VI: Normalizing Flows, Importance Weighting, and Optimization
They go over the tricks you need to actually make ADVI work. We haven’t gotten around to putting these in Stan yet. I’m way more excited about Justin’s work on normalizing flow-based fits. He was here visiting Flatiron for 5 months and we didn’t find models it couldn’t fit. For centered parameterizations of high-dimensional regressions (like a hierarchical IRT 2PL model with funnels as well as additive and multiplicative non-identifaiblity), it outperformed NUTS. The importance sampling is important as the flows themselves tend to have artifacts, especially if you’re fitting something multimodal with a single flow.