Hardware for the fastest sampling

Hi Pymc3 Comunity,
I’m considering buying a new computer and I wanted to know what aspects of the hardware would make for faster sampling?
I know that with multiple cores you can sample more chains at once, but after 4 or so chains to check convergence, there does not seem to be much gain other than more robust convergence checks.
I have found that the GPU does not really boost performance, and I read (i can’t recall were) the GPU only helps with really large matrix multiplications.
The speed of the RNG is probably a crucial bottleneck, but I’m not sure if you can spec a computer for faster RNG?
Lots of speedy RAM would help up to a point I imagine?
So I would think a decent amount of fast ram + 4 fast (Ghz) CPUs is the best that can be done for a personal computer?
Thanks!

1 Like

Good hardware won’t help you that much at the moment I unfortunately. A decent (and well-cooled) CPU and good RAM help for sure, but the difference won’t be that huge. RNG isn’t a bottleneck, we only draw a standard normal from time to time during nuts sampling. I’m still hoping that there will be quite a few problems where a GPU will help, but that is not well explored. But I doubt it will for models with small data sizes.
Most of the speed improvements you can get are the ones you can’t fix with hardware: Good models, good parametrizations and care and experimentation with the theano graph. Profiling also helps a lot. :slight_smile:
You can get a summary from theano:

func = model.logp_dlogp_function(profile=True)
func.set_extra_variables({})
x = np.random.randn(func.size)
for _ in range(1000):
    func(x)
func.profile.summary()

Also, in many models it helps to make sure that you are using a decent blas. For an intel processor that would usually be mkl:
https://conda-forge.org/docs/maintainer/knowledge_base.html#switching-blas-implementation

4 Likes

Ah this awesome, thanks!
I didn’t know about profiling. I’m sure that will help me out going forward :slight_smile:

Yead, aseyboldt gave you as good as rundown as you could expect.

One thing that he didn’t explicitly describe is that, from random posts around here and information elsewhere on the internet, I think for more complicated models (and I mean really more complicated ones), you start needing to use 10+ chains at a time, and in those cases multiple cores do become necessary.

However, even if you do reach that point, it’s just easier to set up a google compute account (or any other cloud service), and specify some high number of cores in a VM, and work in that easily, instead of making a dubious investment in a personal Bayesian computation machine.

aseyboldt, btw, I am using OpenBLAS as my blas since it’s easier to install; is mkl known to be better?

1 Like

About mkl vs OpenBLAS:
On AMD openblas is sometimes a bit faster, on intel mkl is sometimes significantly faster (I’ve seen factors of ~3x). I’d generally go for mkl as a default, especially on intel. I’ve also seen cases where MKL dealt better with over commitment (so if you have several chains running in parallel, each running several cores for blas parallelism, you easily end up with more processes than cores, if you don’t set MKL_NUM_THREADS etc to a lower value.)

If you have a model that does lot’s of linear algebra, just try both. That got really much easier in the last couple of releases of conda-forge.

1 Like

Huh, it’ good to hear that. I always looked at the general benchmarks for OpenBLAS vs mkl, and they always did rather evenly, but I guess those benchmarks are for general linear algebra operations, and MCMC is a bit more complex and hard to wrangle, so mkl does better. Will try to switch, thanks!