Limiting the number of cores/threads used in PyMC5.6+

There seem to be a number of queries posted here about changing the number of cores used, many of them unanswered or unresolved. And most are old. So I’m starting a new question, but correct me if this is wrong practice.

Is there documentation / examples of how to limit the number of cores used in sampling/estimation? Or how to choose a computation backend, if that’s a thing?

This is a simple four-chain estimate. It uses 128 threads!! :

Crazy! And I’m not sure how efficient this is? Aren’t all those kernel processes (ie the red color in the load bars) bad news?

I’m not specifying the cores= parameter in pm.sample, which others say does not help this problem.

I’m interested in what is efficient, and also in limiting the number of cores used so that I can estimate more than one model at once, and also so that my server can do other things!!
My computation processes are niced.

Thanks!
c

You can try setting the environment variable OMP_NUM_THREADS=1.

That did not make any difference!

!echo $OMP_NUM_THREADS
40

But the estimate still used 128 threads.

You can try MKL_NUM_THREADS instead

Same thing using MKL_NUM_THREADS as well: all 128 threads fully used.

Maybe worth trying to reproduce on a more conventional machine and see if the problem also crops up there?

And to be sure did you tey setting them to 1, not 40? Are those real cpu cores or virtual ones?

Okay, I tried on my laptop with the max set to 2 and then with it set to 1. In both cases, all 16 threads of my laptop are used at 100%.

Just to be clear, this does not happen when my data size is small (1000) but does when it is larger (10000). With smaller samples, four threads are used (one for each chain).

How and when do you set the env variables?

From within python, before sampling the model.

     
        if max_processor_threads is not None:
            os.environ["OMP_NUM_THREADS"] = str(max_processor_threads)
            print(f"Set OMP_NUM_THREADS to {max_processor_threads}")
            os.environ["MKL_NUM_THREADS"] = str(max_processor_threads)
            print(f"Set MKL_NUM_THREADS to {max_processor_threads}")
        trace_filename = f"{self.basename}.nc"
        
        print(f"Building model {modelclass} for {self.basename}")
        model = self.build_model(df, modelclass, **kwargs)
        with model:
            trace = pm.sample() #return_inferencedata=True)
        
    ```

Try to do it before any other imports

2 Likes

Thanks.
Problem still exists in PyMC 5.10.4 etc.
The trick for me was that it is numpy which must not have been imported prior to setting

os.environ['OPENBLAS_NUM_THREADS'] = '1' 

But I was working in an ipython environment shortcut which preloaded numpy automatically, before starting up.
So my fix is that I now have a module with the following content, which I import first. It warns me if numpy has already been loaded before I set the OPENBLAS environment setting:

import os
# On my computation server and laptop, stopping openblas multithreading speeds things up by stopping the wild multithreading version of numpy! Do not preload numpy in ipython (or any earlier-loaded modules).
os.environ['OPENBLAS_NUM_THREADS'] = '1' 
# Following seem not to matter:
#os.environ['MKL_NUM_THREADS'] = '1'
#os.environ['NUMEXPR_NUM_THREADS'] = '1'
#os.environ['OMP_NUM_THREADS'] = '1'
try:
    np
    print('\n\nIt looks like you have loaded numpy before we've disabled OPENBLAS crazy-threading. Do not preload numpy in ipython.\n\n')
    raise ImportError("Start ipython without numpy preloaded")
except  NameError as e:
    print(' Successfully checked that numpy was not preloaded before setting OpenBLAS variable.')
    import numpy as np

Now htop looks a lot nicer! this is with 15 estimates now going in parallel, and each taking 4 threads for 4 chains, with no OPENBLAS splitting:


Unlike before, those processes are blue, not red. :slight_smile:

1 Like

Thank you! I run into the same problem when I try to estimate the resource requirements on the login node of our hpc before submitting jobs. It would use lots of threads, which leads to vastly increased running time of pm.fit() in my case. Setting import os; environ['OPENBLAS_NUM_THREADS'] = '4' at the beginning of my scripts seems to work well to limit this.
Interestingly, limiting the threads on my laptop also makes pm.fit() run a bit faster than using all cores (which it does by default). I don’t understand enough about what’s going on behind the scenes, but it seems that the sweet-spot is around 4 threads?

Regarding the optimal number of threads. I have seen in other applications (non-PyMC) that using less threads can boost performance. This happens in particular when the tasks are memory intensive; the data transport overhead then cloggs the processor.

It can also help to try various linear algebra libraries.
But np.show_config() is known to not always be accurate:

the common problem is the system trying to use more threads than those available and then have to wait around until they get released