Regarding the use of multiple cores

Bobrepoman · December 22, 2019, 9:52pm

I was hoping someone could clarify a few things for me when it comes to the use of multiple CPU cores in models.

reading the docs it states

chainsint

The number of chains to sample. Running independent chains is important for some convergence statistics and can also reveal multiple modes in the posterior. If None, then set to either cores or 2, whichever is larger.

coresint

The number of chains to run in parallel. If None, set to the number of CPUs in the system, but at most 4.

But on the forum i found this thread which describes how for the user NUTS uses all the cores on the machine and how more cores can help with some operations like those in GP module. My question is if there should be noticeable performance gains from using more than 4 cores? And if it is possible to use more cores to speed up the sampling of a few chain, say i have 2 chains i want to sample and 8 cores. I am asking because i am trying to run a model using the latent variable Implementation of GP’s and there seems to be no difference if i use 4 or 8 cores. I can also specify more cores than i have in my system (say cores=20) and NUTS will still run, but again no performance improvements. I think my confusion comes from the fact that cores seems to be limited by the number of CPU threads and not physical CPU cores, but i am not sure how this works, the forum thread states the max number of cores that can be used is cores * max(MKL_NUM_THREADS, OMP_NUM_THREADS), but that is still limited to 4 at max?

Thank you for your time.

junpenglao · December 23, 2019, 10:36am

maybe @aseyboldt knows a bit more on this.

aseyboldt · December 23, 2019, 7:02pm

Maybe it helps to explain a bit how the parallelization works:

When you specify cores>1 in pm.sample, pymc will start one new process for each chain. The main process then tells cores of those processes to start sampling, the others will just wait and do nothing. When one of the processes is finished, one of the waiting processes is told to start sampling. There will never be more than cores processes working at the same time. If you have n cores in your computer, it makes sense for most models to set the cores argument to that number so that all of them are working.
Things get a bit more complicated if the models is using very large arrays somewhere or if it involves a lot of BLAS operations (those are matrix-vector and matrix-matrix multiplications and some other dense linear algebra related things, eg in a model involving large GPs). Each of the processes might then start additional workers on its own: With large arrays, theano will start a thread pool using openmp, the size can be configured in the .theanorc and the thread pool size with the OMP_NUM_THREADS environment variable. Depending on the blas implementation, the number of threads those use is controlled with MKL_NUM_THREADS or OPENBLAS_NUM_THREADS.

Unfortunately, those three sources of parallelism do not know anything about each other. So it can easily happen that you start 8 processes with pm.sample(cores=g), and each of those starts 8 processes using blas. This gives you 64 processes in total, which will really slow things down. The operating system will do it’s best to distribute the processes to the available cores/hardware threads, but if there are not enough available, things will slow down because the processes fight over resources like the cache and because you pay the costs of parallelization without any benefit. In cases like that you need to either decrease cores or the number of blas/openmp threads.

Bobrepoman · December 23, 2019, 11:30pm

Thanks @aseyboldt for the informative answer. I now have a clear grasp of how the work is distributed for the sampler. I will definitely try to change the number of threads and test if it helps performance. Will the number of cores specify the number of processes for each core by default? Like in your 8X8=64 blas example?

ceebeelee · July 18, 2023, 3:32pm

Is there documentation / examples of how to set those variables for BLAS or etc?
This is a simple four-chain estimate. It uses 128 threads!! :

Crazy! And I’m not sure how efficient this is? Aren’t all those kernel processes (ie the red color in the load bars) bad news?

Topic		Replies	Views
Cores not optimally used version agnostic bug	16	106	November 26, 2024
What is the effect of running sample() with 4 chains on 16 cores? Questions	2	470	May 7, 2019
Number of cores settings in sampling Questions	3	674	October 30, 2021
Does anyone know of any guides or docs for parallel sampling in current version of PYMC (mp_ctx) v5	3	1146	July 24, 2023
Sample with multiple cores Questions	3	1474	September 10, 2020

Regarding the use of multiple cores

Related topics