I wouldn’t expect multi gpu support to work well without spending quite a bit of time profiling, and only if the problem is sufficiently large and has a structure that makes it easy to divide the logp gradient evaluations nicely. Even for the one gpu case it is not at all obvious that nuts will run faster on a gpu than on a cpu, it really depends on the model.
Documentation for using theano with multiple gpus is here. pymc3 variables are subclasses of thenao vars, so you can use var.transfer on them the same way you would with ordinary theano vars. (At least as far as I know, I don’t have multiple gpus to test this). You can decide where the observations are stored by using a theano.shared with target='whatever'.
3 Likes