Regarding the use of multiple cores

I was hoping someone could clarify a few things for me when it comes to the use of multiple CPU cores in models.

reading the docs it states

chainsint

The number of chains to sample. Running independent chains is important for some convergence statistics and can also reveal multiple modes in the posterior. If None, then set to either cores or 2, whichever is larger.

coresint

The number of chains to run in parallel. If None, set to the number of CPUs in the system, but at most 4.

But on the forum i found this thread which describes how for the user NUTS uses all the cores on the machine and how more cores can help with some operations like those in GP module. My question is if there should be noticeable performance gains from using more than 4 cores? And if it is possible to use more cores to speed up the sampling of a few chain, say i have 2 chains i want to sample and 8 cores. I am asking because i am trying to run a model using the latent variable Implementation of GP’s and there seems to be no difference if i use 4 or 8 cores. I can also specify more cores than i have in my system (say cores=20) and NUTS will still run, but again no performance improvements. I think my confusion comes from the fact that cores seems to be limited by the number of CPU threads and not physical CPU cores, but i am not sure how this works, the forum thread states the max number of cores that can be used is cores * max(MKL_NUM_THREADS, OMP_NUM_THREADS), but that is still limited to 4 at max?

Thank you for your time.

maybe @aseyboldt knows a bit more on this.

Maybe it helps to explain a bit how the parallelization works:

When you specify cores>1 in pm.sample, pymc will start one new process for each chain. The main process then tells cores of those processes to start sampling, the others will just wait and do nothing. When one of the processes is finished, one of the waiting processes is told to start sampling. There will never be more than cores processes working at the same time. If you have n cores in your computer, it makes sense for most models to set the cores argument to that number so that all of them are working.
Things get a bit more complicated if the models is using very large arrays somewhere or if it involves a lot of BLAS operations (those are matrix-vector and matrix-matrix multiplications and some other dense linear algebra related things, eg in a model involving large GPs). Each of the processes might then start additional workers on its own: With large arrays, theano will start a thread pool using openmp, the size can be configured in the .theanorc and the thread pool size with the OMP_NUM_THREADS environment variable. Depending on the blas implementation, the number of threads those use is controlled with MKL_NUM_THREADS or OPENBLAS_NUM_THREADS.

Unfortunately, those three sources of parallelism do not know anything about each other. So it can easily happen that you start 8 processes with pm.sample(cores=g), and each of those starts 8 processes using blas. This gives you 64 processes in total, which will really slow things down. The operating system will do it’s best to distribute the processes to the available cores/hardware threads, but if there are not enough available, things will slow down because the processes fight over resources like the cache and because you pay the costs of parallelization without any benefit. In cases like that you need to either decrease cores or the number of blas/openmp threads.

2 Likes

Thanks @aseyboldt for the informative answer. I now have a clear grasp of how the work is distributed for the sampler. I will definitely try to change the number of threads and test if it helps performance. Will the number of cores specify the number of processes for each core by default? Like in your 8X8=64 blas example?