Cores not optimally used

Hi
I’m using PyMC with default option (i.e. 4 traces, no cores option) on a 4 core/8 thread machine.
Instead of using all cores, it puts 2 chains on one core.

ps uax shows
472689 xx 20 0 2285228 849228 19000 R 50.0 3.5 2:48.93 python
472686 xx 20 0 2285196 849228 19000 R 49.7 3.5 2:48.73 python
472687 xx 20 0 2285196 849228 19000 R 49.7 3.5 2:49.56 python
472688 xx 20 0 2285196 849228 19000 R 49.7 3.5 2:49.31 python

By distributing over all cores, it should alllow a 2x speedup.

I also don’t see the typical core-switching that one usually sees with running CPU intense jobs.

Finally, also testing on a workstation with many cores. Same result.

I think by default pymc uses as many cores as chains? Did you try setting cores argument to pm.sample explictly?

Hi
Setting cores in pm.sample(cores=4) did not change anything.

I think it is deeper down the rabbithole on how pymc deals with parallel jobs.

Setting cores=4, results in PyMC trying to still sample 2 chains x 2 cores only? Can you show the logging message it prints?

It runs 4 chains alright, but they are distributed over 2 cores only, not 4.

What log do you like?

This line: pymc/pymc/sampling/mcmc.py at 5352798ee0d36ed566e651466e54634b1b9a06c8 · pymc-devs/pymc · GitHub

It reports:
Multiprocess sampling (4 chains in 4 jobs)

What does PyMc use for multithreading?

Tried OMP_NUM_THREADS=4 and OMP_NUM_THREADS=1, but saw no difference.
Ipython vs python made no difference either.

Can you show a complete example that we can run locally to compare?

This here uses 4 cores as it should on my machine with pymc 5.18.2:

import pymc as pm
import numpy as np
import pytensor.tensor as pt

with pm.Model() as model:
    pm.Normal("x", shape=1000)

with model:
    trace = pm.sample(draws=100_000, chains=4, cores=4, blas_cores="auto")

You can also try to change the blas_cores argument. If your logp function spends a lot of time in blas calls (matrix multiplications etc), that might change the behaviour, although I don’t understand why it would only use half the cores by default. This also uses 4 cores locally for me:

import pymc as pm
import numpy as np
import pytensor.tensor as pt

A = np.random.randn(1000, 1000)

with pm.Model() as model:
    x = pm.Normal("x", shape=1000)
    b = pm.Normal("y", mu=A @ x, shape=1000)

with model:
    trace = pm.sample(draws=100_000, chains=4, cores=4, blas_cores="auto")

Thanks for helping out.
I tried the first script with also changing the blas_cores, same result as before: only 2 cores used, 50% cpu per python process.
This is with pyc 5.18.1

I start to suspect whatever library pymc is using, is misbehaving. Also because this odd core stickiness.
Usually jobs move around when the cpu is not fully loaded to improve thermals. Here this doesn’t happen.

I regularly use Python’s multiprocessing library, and that works fine.
I checked whether psutils returns the right number of cores, and it does.

That’s really strange. PyMC just uses multiprocessing, there is nothing particularly special happening.

Can you try to disable the progessbar? That also has some extra threads, and while I don’t see why it would cause something like that, better to disable it for debugging I think.

Can you also check if this is using the 4 threads correctly?

import multiprocessing as mp

def run_loop():
    while True:
        pass

processes = [mp.Process(target=run_loop) for _ in range(4)]

for process in processes:
    process.start()

Other than that, what operating system are you using? How did you install the packages? Anything else that could be unusual about your setup?

The multiprocessing script uses all cores.
Turning the progress bar off in the previous script made no difference.

The multiprocessing function call has a cores option too.
E.g. pool = Pool(maxcpu=20)

Is this inherited from pymc?

For the rest, this is Ubuntu with anaconda, PyMc is installed with pip.

What happens if you install via the official installation instructions?

Cluhmann,
That actually fixed it.
Thank you!

(very frustrating python’s package management!)

I have to say your case is very puzzling :confused:

Thanks again for the help!
I guess there is some subtle change between versions how the multiprocessing via pipes is done…