NUTS uses all cores

#1

I am running pymc3 on a machine with a large number of cores (>32) and when sampling using NUTS all of the cores are being utilized. I have set the following environment variables:

MKL_NUM_THREADS=8
OMP_NUM_THREADS=8

but I still observe the behavior in NUTS though Metropolis is well behaved. Are there any other variables I should set to limit the core usage?

#2

Some of the theano_ops will use all the cores, for example in the GP module. I think @bwengals knows a bit more on this.

#3

It is strange because I was alright using NUTS on some of the problems I was working on yesterday but now something seems to have changed.

If there is a way to limit NUTS to only use a specific number of cores that would be very helpful.

#4

The reason it worked before was user error. I was editing a jupyter notebook and some variables were stored that I didn’t recognize.

#5

Do you use MvNormal distribution in your model?

#6

Yes I am.

#7

Yeah from Multidimensional gaussian process

The matrix operations used by Theano here are multithreaded, so running multiple chains simultaneously bogs things down.

I am not sure how to limit it to single thread per chain tho.

#8

Okay, it isn’t an issue at the moment but it would be helpful if that was something we could set in the future.

#9

Hey, I found the same issue for HMC and NUTS. It is really troublesome since i am working on a shared machine where i am not allowed to occupy all CPUs. Is there a way to limit the cores that a sampler can use? Metropolis works well by setting cores=1.

#10

There are three reasons why NUTS and HMC would use several cores:

  • Some theano ops use BLAS, which will usually be multithreaded. There are several implementations of BLAS, and which one we use depends on which one numpy uses (you can check with np.__config__.show()). If you are using MKL, you can control the number of threads by setting the environment variable MKL_NUM_THREADS. The same variable should also work for openblas. If you are using atlas, then you are out of luck, as that one must be configured at compile time.
  • Some theano ops use openmp explicitly. You could switch that off entirely by setting a config option in ~/.theanorc: http://deeplearning.net/software/theano/library/config.html#config.openmp. And you can control the number of threads using OMP_NUM_THREADS.
  • By default we use multiprocessing to parallelize several chains. You can control the number of cores we use there by setting the cores kwargs in pm.sample(cores=4). So the total number of cores you might be using at max is cores * max(MKL_NUM_THREADS, OMP_NUM_THREADS).
4 Likes
#11

So cores is not a correct argument name…

#12

FYI, by setting MKL_NUM_THREADS =1 and turning of the Theano config.openmp, the sampler seems to work actually faster. Maybe this has sth to do with the overhead induced by inter-processes communication.