I thought the Hamiltonian should be negative of the log-likelihood.
But in this post, there is no need for the negative sign.
pm.Potential("M2 ", pm.Binomial.dist(p = beta_p_M1, n = N).logp(y_obs))
Does using pm.Potential
significantly slow down the model when I use NUTS on gpu?
PyMC3 takes the negative during HMC, so you can keep the log-likelihood as it is.
Also, you should not use NUTS on GPU in general, as the sampling is done in CPU (leapfrog, tree building, slice sampling). So unless you have expensive matrix computation in your model, use CPU only is faster.
The model has 5 \times 10^6 iterations of matrix multiplications. Can use less iterations if I can sub-sample and provide a stochastic gradient to pymc (but I heard that NUTS can’t use stochastic gradient). Can also trade throughput for speed if I approximate the likelihood. If I use gpu, potentially I can fit 2 \times 10^{5} replicas of the model concurrently on a gpu with 4GB if I use 2GB to hold the data… Since all models are independent, the overall log-likelihood is sum of log-likelihood of all the replicas.
Is tree-building the main problem in implementing NUTS for gpu?
Can I provide methods that dynamically scale the number of replicas that I fit to NUTS? I was thinking about wrapping chainer calculations into a theano operation and use a buffer to hold the growing/shrinking number of replicas.
You need careful error correction to do stochastic gradient in HMC - I have no experience of how difficult/useful it would be.
Not sure I understand your usecase - what do you mean by all models?
The problem is tree building is a recursive and difficult to implement in theano (which is a static graph). So we move it into python land thus it will be run in CPU
What do you mean by replicas?
I will use fitting Y = X\beta as an example.
- Both X and Y are data.
-
\beta is the parameter that I am trying to find.
-
\mathcal{L}(\beta) is the log-likelihood function of the model
My goal is to find a value of \beta that maximizes \mathcal{L}(\beta).
I can use pymc3 to fit one replica of the model:
- Define one scalar prior R for \beta
- Give the theano operation for the log-likelihood \mathcal{L}(\beta) to NUTS.
- Run NUTS to sample \beta
Alternatively, I can use pymc3 to fit 10 replica of the model.
- The i^{\text{th}} replica is a Y = X\beta_{i} model with the same data X and Y.
- The ten priors R_{i} for the ten replicas have the same distribution.
- All ten replicas have the same log-likelihood function \mathcal{L}(\beta_{i})
- All ten replicas are independent. If I have a super-model that consists of all ten replicas, the log-likelihood function of the super-model is sum of log-likelihoods of all ten replicas, which is \mathcal{S}(\beta_{1}, \ldots, \beta_{10}) = \sum_{i} \mathcal{L}(\beta_{i}). This should be same as using an array of log-likelihoods (NUTS and array of log-likelihoods).
- Give the theano operation for the log-likelihood \mathcal{S}(\beta_{1}, \ldots, \beta_{10}) of the super-model to NUTS.
- Use NUTS to sample all ten \beta_{i} simultaneously.
- Combine the ten posterior distributions for all ten \beta_{i} to form one posterior distribution.
The end result should be same as fitting one replica of the model, except that I get a higher throughput. I plan to combine the matrix multiplication step in all the replicas into one big matrix multiplication step on the gpu.
Ok … NUTS doesn’t know this. The mass matrix would be 10^{2} times larger. Hopefully theano can schedule that on the gpu.