Firstly, thanks a lot for developing and contribute to PyMC3!
Straight to the point: I have been wondering if it would be possible to run PyMC3 in HPC environments, like super-computers with several nodes. Anyone with experiences to share? It would be of great help to me.
The thing is: I have a really hard and computationally intensive problem. I solve differential equations with a lot of parameters, all fancy stuff like that, etc, etc. Imagine now how long each realization takes… well, so far SMC has been saving me and my group since it runs nicely in parallel (in a single node machine). But I did it successfully only in a single machine with a lot of cores. However, we have a large super-computer available for us, and initial tests have failed.
I did some investigation and found that, behind SMC sampling, multiprocessing is being used. This is a pretty nice lib, but it can’t handle multi-nodes machine AFAIK. However, there is hope: ray and its multiprocessing implementation. With a single line change in the code (more precisely, in the import), multiprocessing can be used in a cluster by providing a proper configuration file (which is a user responsibility). What do you PyMC3-devs think about it? Would it be worth?
I have been thinking about trying to do some contributions to PyMC3. In case you like the above idea, it would be a pleasure to open a PR from my side
I’m sure that any robust PR implementing new features would be very welcome.
I just wanted to check that you were aware of PyMC4? One of the aims seems to be better GPU usage via tensorflow. I imagine that a biproduct of this would be better multi-node usage as well. I mention it in case it is helpful .
I am not familiar with ray - SMC need access to all chain/batch at the end of each step for resampling, not sure whether that would work with ray. If you find it useful, we certainly welcome a PR (we can set is as a flag).
Did you try to see if your solution works?
Otherwise, as @sammosummo mentioned, you can look at pymc4/TFP, which I recently added SMC as well https://github.com/tensorflow/probability/tree/master/tensorflow_probability/python/experimental/mcmc/examples (there is even an ODE example)
I’m sure that any robust PR implementing new features would be very welcome.
Great! I’m working on some tests.
I just wanted to check that you were aware of PyMC4? One of the aims seems to be better GPU usage via tensorflow. I imagine that a biproduct of this would be better multi-node usage as well. I mention it in case it is helpful .
Ah, yes! PyMC4 looks pretty interesting, I checked it out. However, since this is a very critical research (about COVID-19), PyMC4 didn’t sound appropriate since it points itself as an early-stage project. So our decision was to use a consolidated tool (like PyMC3) instead of a new and under development one. But we want to test PyMC4 in the future because of its GPU capabilities, as you mentioned. Thanks!
I am not familiar with ray - SMC need access to all chain/batch at the end of each step for resampling, not sure whether that would work with ray. If you find it useful, we certainly welcome a PR (we can set is as a flag).
I think that ray will do alright. If multiprocessing does the right thing, ray will do it as well. The only difference is that ray has an implementation of multiprocessing that can be aware of multiple nodes, while multiprocessing can’t detect or set workers from multiple nodes, at least it’s what I understand.
Did you try to see if your solution works?
I’m working on it at this moment! If everything looks good, I’ll submit a PR for you guys to check it out.
Great work, well done! However, there is no clear API at this moment and generate results with it while working with colleagues that don’t have previous experiences with the tool would require a time we don’t have, unfortunately. But, as I mentioned before, I have interest to learn in the future. I didn’t know that SMC was implemented in TFP by you. When we began the project, you haven’t submitted the PR yet. But I’m now watching the repo!
You are welcome! If the speed bottleneck persist and you are interested to explore solution in PyMC4/TFP, feel free to reach out and I can (or I will find someone who can ;-)) help you porting the model and inference. FWIW, the implementation of TFP SMC is very similar to PyMC3, with additional flexibility to use HMC as internal mutation which should make it scale much better to more dimensions. Also, if you are working on Covid and need to fit many time series at the same time, TFP generally gives better support for multi-batch which means you can fit multiple copies of the same model at once.
Amazing, @junpenglao! Right now we are stuck documenting current results in a paper. But as soon as I finish this part, I will try to contact you. Actually multi-batch would be very useful since we analyze and simulate for several locations. TFP looks pretty exciting, I have to investigate it further.
Out of curiosity: I didn’t compare your SMC with the one inside PyMC3. In the PyMC3, looks like an implementation of Cascading Adaptive Metropolis in Parallel (CATMIP), if I understood it right. Your implementation in TFP follows the one in PyMC3 or it’s another method? Since you commented about the mutation (there is a similar mechanism in CATMIP, if I remember well), is it different when compared to PyMC3?
Yep we have adaptive tuning very similar to the one in PyMC3, but with the flexibility that you can do the same for HMC. We are planning to add more tuning like the one from https://arxiv.org/pdf/1808.07730.pdf
Hey, I am trying to do something similar – Segmentation fault (core dumped) on running pm.sample in Ubuntu.
What is the most efficient way to do this as of now (especially for large data)? I have access to a HPC environment but not sure if I’m using it to its fullest with PyMC3. When I run a pymc3 program on 6 nodes, only one of the nodes is full at 100% cpu and the other nodes are more or less idle as checked with the seff command. Am I not doing it right or is this the expected behaviour?
From this discussion, here’s my take away:
Right now, for PyMC3, there is no support for computing on multiple nodes (though there is scope to do so) so increasing number of nodes won’t help as of now.
Increasing number of cores does help/speed up the sampling process.
Best solution for now: Use only a single node but try to get max. number of cores on it.
Can someone confirm if my take away is right?
Is there a better “best solution for now”?
My model is a probabilistic DAG but it also follows time series and I want to do bayesian inference with it. Any alternatives to PyMC3 that might be faster for my use case? How about Stan/PyStan, is it faster?
I realize this is an old topic, but I figure it may still be of interest for many. Ray has already been mentioned, and it has developed incredibly fast since the question was posted. I haven’t tried it myself, but it should be possible to use Ray at least via JAX: PyMC3 can use JAX as a backend (see here: Using JAX for faster sampling — PyMC3 3.10.0 documentation), and JAX can utilize Ray. There has just been a keynote on this combination by JAX developer Matt Johnson on the Ray Summit that is currently taking place.
Please, could you share a link to the aforementioned keynote on this combination?
A lot has changed since this question was asked, I would appreciate other ideas about running Pymc3 with large datasets using multi-node clusters (especially VI).
@junpenglao Are there are other threads on this forum with some insights or perhaps some examples from members of this group would be helpful.