Sorry if it’s a too silly question, I’m a newbie in pymc3 and bayesian statistics. Just wondering if the results of sample can be combined to get a richer estimate of distribution. I’m asking because my experiment is very slow (high dimensional bayesian logistic regression based on categorical variables with many levels) so I was wondering if I can run it on several CPU’s in parallel and get the results of each sample combined after.
Thanks
Patrick
Maybe this can help?
As long as each CPU is running the same model with the same data (& same RNG, I suppose), it is equivalent to running multiple chains on the same CPU. So if you can spin up a bunch of boxes, you can sample many chains simultaneously, each one of which (ideally) represents samples from the same posterior.
You’ll have to manually stich these together into a MultiTrace
if you want to check convergence statistics through pymc3; though this isn’t too difficult.
Thank you for your help to both of you @Hamster_on_wheels and @chartl, it’s very appreciated. I’m little bit confused though as it seems to say that I can partition my data by hand (chain 1 on CPU1, chain 2 on CPU 2, etc) and recombine them after but after reading the abstract of Expectation propagation paper, they seem to say it’s not that simple:
(…) the idea of distributed modeling and inference has both conceptual and computational appeal, but from the Bayesian perspective there is no general way of handling the prior distribution: if the prior is included in each separate inference, it will be multiply-counted when the inferences are combined; but if the prior is itself divided into pieces, it may not provide enough regularization for each separate computation, thus eliminating one of the key advantages of Bayesian methods.
In my comment, “chain” refers to a single sequence of n
NUTS draws from the posterior, starting at some initial (random) point. Re-setting to a different starting point and generating a new sequence therefore generates a new chain.
By default, these are done in parallel in PyMC3, with one core of a CPU devoted to sampling from a single chain.
This can be expanded to multiple CPUs by providing each CPU with a copy of the entire dataset.
Splitting the dataset up across multiple CPUs (“batching”) is another matter entirely, and is an active area of research generally. See here for some ideas:
Thanks again for your help @chartl, I understand now the subtlety. Could you tell me a last thing: the way I understand it, there is not a great gain to use more than 4 chains. With this in mind, would it be better in a cluster of say, 100 cores, to parallelize calls to sample?
So instead of doing sample(10000) on CPU1 (chain1),…, sample(10000) on CPU4 (chain4), would it be possible to do instead sample(100,chains=4) on CPU1,… etc. up to sample(100, chains=4) on CPU100 in order to exploit all the CPU’s in my cluster? Maybe it would imply to make sure all sample calls begin with the same starting point?
by the way, what does RNG mean (in your first answer)?
Cheers,
Patrick
@chartl, Sorry I think I have confounded samples and chains, samples are not independent, chains yes, so I should parallelize over chain, not over sample, right?
So the gain of doing this on a big cluster is not so large (if my assertion about the number of chains in my last message is true). Am I correct?
It’s better to parallelize over chain, up to a point.
Each chain begins with a sequence of tuning samples (by default 500 draws) which allows for the system to converge to the posterior, and the step size to adjust to the desired acceptance rate (this is the tune=
parameter for pm.sample
). So if you go whole-hog and draw 1 sample per CPU, many cycles are wasted during the tuning phase. A good balance might be to do sample(3*K, tune=K)
and distribute it over (10000/3*K) cores (for K=500, this is like 7 cores).
RNG is Random Number Generator
Sorry @chartl, I’m not sure I understand (probably because I am still ignorant of the principles behind MCMC). Are you saying that each core would do (for example):
sample(1500,tune=500,chains=4)
and that after I can combine the results in one Multitrace? If yes, it ends up parallelizing over sample, doesn’t it?
If you ran that on (say) 3 cores, you would have 12 independent chains of 1500 samples.
Again I’m not sure I understand, sorry @chartl. I understood in your past answer that you suggest to do
sample(1500,tune=500,chains=4)
on each of 7 cores. Am I correct? If no, could you please tell me the exact call you would do on each core?
My laptop has 1 cpu with 4 cores; my desktop has 1 cpu with 4 cores. They both have the data; and they both would run
sample(1500, tune=500, chains=4, cores=4)
Each computer would run 1 chain per core (simultaneously). With 4 cores per computer this would lead to 8 total chains. They would pickle the results and save them to a central location.
Subsequently, I could load both MultiTrace objects (1 from laptop, 1 from desktop) and combine them into a single 8-chain MultiTrace.
Ok, great, thank you for your patience @chartl!