New machine does not use more than 1 core for linear algebra, unresponsive to changing env variables

Hi there! I am using pymc3 to fit data corresponding to the Fourier transform of a signal. I have a codebase that is rather complicated, but I can reproduce the behavior with a simple snippet, shown below, which just generates a brief signal and fits the real and imaginary parts of its Fourier transform.

The issue is that on my previous machine, running Ubuntu 18.04 with an Intel processor, pm.sample(chains=1) would use by default 6 cores (the machine has 12 if counting hyperthreading). This was with no modification to any environment variables, and with no .theanorc. On a new machine with a fresh install of Ubuntu 20.04 with an AMD processor with 8 cores (16 with hyperthreading), it never uses more than 1 core. Even setting OMP, MKL, GOTO, and OPENBLAS_NUM_THREADS to 16 (a scattershot approach, admittedly) does nothing to change this behavior. However, running check_blas.py in the theano/misc/ directory shows default usage of 8 cores, down to any number less than 8 when MKL_NUM_THREADS is changed. More to the point, the numpy.config.show() output indicates MKL is being used, which I know is not great on AMD, but should still utilize more than one core for large array operations (and indeed does so when check_blas.py is run).

Could MKL be causing issues in the context of pymc3 in some way that does not come up in the theano check_blas.py? What setup and installation procedure would you prescribe for a system using AMD? I have not built numpy or theano from source before so I would need some guidance in how to build them using non-default blas libraries.

Please provide a minimal, self-contained, and reproducible example.

import pymc3 as pm
import numpy as np

length = 100
x_pos = np.linspace(0,99,length)
signal = np.random.rand(length)
u_pos = np.linspace(0, 10, 10000)

fourier_matrix_real = np.exp(np.cos(2.0*np.pi*np.outer(u_pos, x_pos)))
fourier_matrix_imag = np.exp(np.sin(2.0*np.pi*np.outer(u_pos, x_pos)))
data_real = np.dot(fourier_matrix_real, signal)
data_imag = np.dot(fourier_matrix_imag, signal)
sd = 0.1*np.sqrt(data_real**2+data_imag**2)

with pm.Model() as model:
    signal_prior = pm.Uniform('signal',lower=0,upper=1,shape=length)
    prediction_real = pm.math.dot(fourier_matrix_real, signal_prior)
    prediction_imag = pm.math.dot(fourier_matrix_imag, signal_prior)
    likelihood_real = pm.Normal('real_FT',mu=prediction_real,sd=sd,observed = data_real)
    likelihood_imag = pm.Normal('imag_FT',mu=prediction_imag,sd=sd,observed = data_real)

    posterior = pm.sample(chains=1)

Please provide the full traceback.

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Sequential sampling (1 chains in 1 job)
NUTS: [signal]
#progress bar, no errors

Please provide any additional information below.

Here is the output of numpy.config.show()

blas_mkl_info:
libraries = [‘mkl_rt’, ‘pthread’]
library_dirs = [’/home/daniel/anaconda3/envs/py38/lib’]
define_macros = [(‘SCIPY_MKL_H’, None), (‘HAVE_CBLAS’, None)]
include_dirs = [’/home/daniel/anaconda3/envs/py38/include’]
blas_opt_info:
libraries = [‘mkl_rt’, ‘pthread’]
library_dirs = [’/home/daniel/anaconda3/envs/py38/lib’]
define_macros = [(‘SCIPY_MKL_H’, None), (‘HAVE_CBLAS’, None)]
include_dirs = [’/home/daniel/anaconda3/envs/py38/include’]
lapack_mkl_info:
libraries = [‘mkl_rt’, ‘pthread’]
library_dirs = [’/home/daniel/anaconda3/envs/py38/lib’]
define_macros = [(‘SCIPY_MKL_H’, None), (‘HAVE_CBLAS’, None)]
include_dirs = [’/home/daniel/anaconda3/envs/py38/include’]
lapack_opt_info:
libraries = [‘mkl_rt’, ‘pthread’]
library_dirs = [’/home/daniel/anaconda3/envs/py38/lib’]
define_macros = [(‘SCIPY_MKL_H’, None), (‘HAVE_CBLAS’, None)]
include_dirs = [’/home/daniel/anaconda3/envs/py38/include’]

Versions and main components

  • PyMC3 Version: 3.9.3
  • Theano Version: 1.0.5
  • Python Version: 3.8.5
  • Operating system: Ubuntu 20.04
  • How did you install PyMC3: conda, no flags besides -c conda-forge

Thank you for your attention, and for developing pymc3!

That is indeed odd that threading stopped working. @aseyboldt has done a lot with this, I wonder if he has any ideas.

Were there ever any developments regarding this issue? The issue has still persisted after recent updates to pymc3 and its dependencies.

Maybe this is not the answer you are looking for. Since you are in Linux and you are using conda, you should define the default channel you want to use (main or conda-forge). After that, the nub of the issue is to install MKL or OpenBLAS (depending on your processor), and yes, it might be hard. Then, please update PyMC3 and Theano and Arviz.

Now, you need to include some environment variables in your .bashrc:

export OPENBLAS_NUM_THREADS=1
export MKL_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OMP_NUM_THREADS=1

You wrote GOTO, I am not familiar with that environment variable.

After this, your code:

import pymc3 as pm
import numpy as np
import arviz as az

SEED = 99138 # from random.org
np.random.seed(SEED)

length = 20 # you may want to change this
x_pos = np.linspace(0,99,length)
signal = np.random.rand(length)
u_pos = np.linspace(0, 10, 10000)

fourier_matrix_real = np.exp(np.cos(2.0*np.pi*np.outer(u_pos, x_pos)))
fourier_matrix_imag = np.exp(np.sin(2.0*np.pi*np.outer(u_pos, x_pos)))
data_real = np.dot(fourier_matrix_real, signal)
data_imag = np.dot(fourier_matrix_imag, signal)
sd = 0.1*np.sqrt(data_real**2+data_imag**2)

with pm.Model() as model:
    signal_prior = pm.Uniform('signal',lower=0,upper=1,shape=length)
    prediction_real = pm.math.dot(fourier_matrix_real, signal_prior)
    prediction_imag = pm.math.dot(fourier_matrix_imag, signal_prior)
    likelihood_real = pm.Normal('real_FT',mu=prediction_real,sd=sd,observed = data_real)
    likelihood_imag = pm.Normal('imag_FT',mu=prediction_imag,sd=sd,observed = data_real)

    posterior = pm.sample(random_seed=SEED) # pay attention to this
    
print(az.summary(posterior))

How long did it take? How many chains were sampled?

I am writing this because I don’t know if you have the best configuration for PyMC3, so the first step is to change those four variables in your .bashrc from 1 to 16 (change all of them at once). If you do those changes in the terminal, don’t forget to run source .bashrc. When you have found the optimal number (for me 1 is the optimal number, otherwise it won’t run), change this line to include the option cores:

posterior = pm.sample(random_seed=SEED, cores=1) # pay attention to this, change cores from 1 to 16

With that, I’m sure that you will find that cores=16 uses every core of your processor, in fact, if you don’t include the option cores, the number of chains will be eight (am I wrong?). Why? You said you have a new processor with eight cores, so by default PyMC3 uses that number for the number of chains to be sampled, in other words, all the cores are used when the model is being sampled. If you want to change the number of chains, use the option cores.

If I am not mistaken, PyMC3 uses one thread (although it might be one core) for one chain. Remember that you care about the chains, not the cores.

And you keep using MKL for AMD, this post could help you.