Maybe it helps to explain a bit how the parallelization works:
When you specify cores>1
in pm.sample
, pymc will start one new process for each chain. The main process then tells cores
of those processes to start sampling, the others will just wait and do nothing. When one of the processes is finished, one of the waiting processes is told to start sampling. There will never be more than cores
processes working at the same time. If you have n
cores in your computer, it makes sense for most models to set the cores
argument to that number so that all of them are working.
Things get a bit more complicated if the models is using very large arrays somewhere or if it involves a lot of BLAS operations (those are matrix-vector and matrix-matrix multiplications and some other dense linear algebra related things, eg in a model involving large GPs). Each of the processes might then start additional workers on its own: With large arrays, theano will start a thread pool using openmp, the size can be configured in the .theanorc and the thread pool size with the OMP_NUM_THREADS
environment variable. Depending on the blas implementation, the number of threads those use is controlled with MKL_NUM_THREADS
or OPENBLAS_NUM_THREADS
.
Unfortunately, those three sources of parallelism do not know anything about each other. So it can easily happen that you start 8 processes with pm.sample(cores=g)
, and each of those starts 8 processes using blas. This gives you 64 processes in total, which will really slow things down. The operating system will do it’s best to distribute the processes to the available cores/hardware threads, but if there are not enough available, things will slow down because the processes fight over resources like the cache and because you pay the costs of parallelization without any benefit. In cases like that you need to either decrease cores
or the number of blas/openmp threads.