I’d like to use minibatch to train some models that essentially look like an ‘outer product’ of two vectors (respective shapes n, m) where the observation is a giant sparse matrix (Poisson).
The dimensions m and n are very large, so I can’t hold the whole thing in memory.
- Has anyone gotten minibatch to work for simple MAP estimation? I imagine the
find_map() function won’t work here and we’d need a MAP estimator that uses the variational API with, e.g., SGD.
- Do we have any good examples of using minibatch approaches on “wide” models? My intuition is that results will be highly sensitive to the batching shape in such models.
One idea I had is to somehow use hierarchical parameters to model the “mean field” effect of the parameters and observations that are “not in memory” at a given moment. It’s not obvious to me how to do the modeling and also how to do the inference with the machinery that we have in place currently.
If you have hyper-prior in your model for the two vectors, then it should have the property of modeling the effect of the parameters and observations that are “not in memory” at a given moment.
However, I am not sure if it is easy to use the minibatch to index the random sample as you will have a different batch size for the two vectors right? Maybe it is easier to creat your own minibatch generator (example here).
I don’t understand your setup yet. What exactly is so large that it doesn’t fit in memory? If it is just the observations, then you could also move the likelihood function in a special op, that calls to mpi or so to do the computation in a distributed manner.
I guess a minibatch approach would be faster, but I don’t think we have a minibatch algo for find_map yet (not sure how difficult it would be to add this)
It is primarily the observations that need minibatching. However, the model has intermediate parameters with large dimensions (e.g. outer products of (m) and (n) vectors) that are also big and benefit from minibatching.
I see. If you want to do that with minibatches, then you’d probably have to do some work on your own. Maybe @ferrine has an idea how we could reuse some of the optimisation code from advi for
Have you tried to avoid computing the outer product? In many cases you can get around that by using associativity of matrix multiplications.