The model has 5 \times 10^6 iterations of matrix multiplications. Can use less iterations if I can sub-sample and provide a stochastic gradient to pymc (but I heard that NUTS can’t use stochastic gradient). Can also trade throughput for speed if I approximate the likelihood. If I use gpu, potentially I can fit 2 \times 10^{5} replicas of the model concurrently on a gpu with 4GB if I use 2GB to hold the data… Since all models are independent, the overall log-likelihood is sum of log-likelihood of all the replicas.
Is tree-building the main problem in implementing NUTS for gpu?
Can I provide methods that dynamically scale the number of replicas that I fit to NUTS? I was thinking about wrapping chainer calculations into a theano operation and use a buffer to hold the growing/shrinking number of replicas.