PYMC3 Multiprocessing issue on a kubernetes cloud

Im currently debugging a PYMC3 based script for training a model basing on advi. The job is being run on a Kubernetes node - the pod that runs the job has 8vCPU available (request of 7500m specifically, without a limit). The problem is that when I am checking the CPU utilization graphs the actual usage never goes beyond 4 CPUs. However, in a local docker container it utilizes all available cores. Also, worth noting is that all of the virtual instances are based on Intel XEON CPUs and the node I am using has 8 virtual CPUs.

I have found a couple of places in pymc3 and theano where core detection comes into play. In pymc3 there is a parallel_processing script where CPU detection is now being done with multiprocessing.cpu_count() (previously psutil.cpu_count()). However, this one is only being used in sampling and in the experimental SMC-ABC. And we are not using sampling but variational inference (from inference module). In theano there are a couple of places and a cpuCount() function but on the

So, my question is, how does pymc3 handle multiprocessing, and how to control it? I see that in pymc3 it is possible to pass a dict argument to theano, but I haven’t found anything useful yet.

Another question would be: in case of variational inference and ADVI is it worth worrying about multiprocessing at all? Maybe the gains are so negligible that I should just forget about this and move on?

Forgive any factual errors I might have done in the area of machine learning but I am a QA Engineer trying to research a performance issue on our kubernetes setup.

Thanks in advance for any tips and pointers.
Best and stay safe,

The source of the other threads will not be multiprocessing I think, but either openmp through theano if you have big datasets somewhere, or blas if you are using eg matrix-vector products.
You can configure the theano parallelization using as described here:
For BLAS it depends a bit on which implementation you are using (one of MKL, openblas, blis probably). On an intel cpu usually MKL is the goto implementation, you should get that automatically if it is using conda internally, I’m not sure how that works on your base image.
You should be able to control the number of MKL threads with the environment variable OMP_NUM_THREADS.
If the trouble is worth it really depends on your model. If you do large matrix vector products it might very well be.

Thanks. I have attempted setting OMP_NUM_THREADS=8 and it still only utilizes 4. config.openmp is set to True in theanorc. Is there anything else I can try?

Which BLAS is it?
I’d try disableing parallelization and then switch in on one by one to find out what’s going on.
Or, maybe better: try pref record and perf report to see what it is actually doing.

Okay, new development. OMP_NUM_THREADS variable works fine. But only if I set it to below 4. I limited it to 2 and it worked. But when I increased it to 8 it reached only 4 cores.

