Minibatch and NUTS


#1

Am I correct in saying that the minibatch mode doesn’t work with NUTS?


#2

I am actually not sure if plain minibatching would work with any MCMC method for that matter. Detailed balance would likely be messed up if different parts of the data are exposed to the sampler at different iterations.

That being said, what can definitely work is a step-by-step updating of the posterior with small subsets of the data. As in, starting with a prior, you expose a small subset of the data to get an intermediate posterior. This intermediate posterior then becomes your prior when the next subset of data is seen, and so on. The problem with this in practice would be to actually specify the intermediate posterior as a prior - it is hard to do so without making any assumptions. One potential way would be to assume that the new prior lies in the same family as the original prior that was specified - then you can take some sort of moments of the intermediate posterior and specify the new prior with that. I do not know if pymc3 has some easy way of doing this, there was a similar discussion with ADVI here:


#3

There is some work in the literature on mini-batch MCMC but suffice to say you’d need to write your own inference algorithm there. Within PyMC your best bet for a large data set is minibatch variational inference.

What I was referring to in the linked post is something more like Streaming Variational Bayes. However that’s not available in any of the major probabilistic programming languages. The problem with SVB is that the inner loop (run on each mini batch) requires convergence of your optimiser, which in practice is not as fast as just training an SGD optimiser against the entire dataset.

Can you share some details about the model you’re trying to fit?


#4

Speaking of this, I remember I had this twitter conversation with Dan Simpson last year:

Also, pseudomarginal methods tend to take a geometrically ergodic algorithm and make it no longer geometrically ergodic. And not GE = not useful because there’s no central limit theorem. (That’s not strictly true. But it’s almost true. Most of the time you don’t luckily land in one of the bigger classes of non-GE Markov chains that still satisfy a CLT)

The paper recommend were: Noisy Markov Chain Monte Carlo: https://arxiv.org/abs/1403.5496, Some theoretical ideas are hiding here: https://arxiv.org/abs/1205.6857


#5

I don’t see why subsampling woukd break detailed balance, surely as long as individual moves are reversible then we’re ok in this regard?

The system is a hierarchical logistic regression model. I have a lot of data (about 200k samples, a single random-effect i.e. grouping variant and quite a few fixed effects). Right now I’m using ADVI, but I’m aware that it ignores correlations in the posterior distbn.


#6

Hi erlendd - have you looked at FullRankADVI, or normalizing flows to add structure to your posterior? You can do something like NFVI with some number of householder transforms (plus scale/loc) if you don’t want to go all the way to FullRankADVI (or if you want something even more complex).


#7

Actually I’ve always had problems with FullRankADVI - it runs and about 1/3 of the way through starts giving NaNs.


#8

I’ve found that reducing the learning rate and/or examining the scaling of the parameters can help with the -inf/NaN. Agreed in general though - it often goes off track, and that’s why I tend to use NFVI /w some number of HH transforms to move between totally correlation-free posteriors and full rank ones.


#9

Do you have some examples of using NFVI? The other day I was just discussing with @ferrine that we dont have a good practical example of such.


#10

Hi @junpenglao, I don’t have any illustrative examples that would be really ideal for teaching people about the areas that NVFI is useful in. Mostly just really domain specific use-cases :S Sorry.

Often I find myself just using it in the situation kinda like the OP - where I want to use ADVI, but FullRank is unstable and I anticipate only needing a ‘bit’ of correlation (in some hazy, unspecified way), or I feel wild and adventurous and want to try using planar flows to modal bimodality (note: this never really works). I imagine a good tutorial might have to be some kinda standard problem (GLM? Hierarchical model?) that we engineer back-to-front to be mildly ill-suited to vanilla advi, and walk people through the differing complexity of approximations available and show the way it moves us toward a more exact inference like NUTS.


#11

This is off base from the main discussion here, but I did a few basic experiments of my own for mean-field vs full-rank ADVI earlier this year: https://gist.github.com/ColCarroll/d673a3af7169bd713bcbdb9445d4a543

Dan Lee, and then Dan Simpson, hopped in a discussion about it as well, which I found helpful: https://twitter.com/colindcarroll/status/967078763384201216


#12

Re: Nan’s with FullRank. I think this is because the cholesky decomposition is not stabilised in the current implementation. A notebook with a failing case like that would be great to let me assert that is indeed the issue.