Bayesian VAR example notebook: extremely low sampling rate

Hi. First of all, thank you for PyMC and the examples on the website. Great work!

I have started playing with the Bayesian Vector Autoregressive Model example at Bayesian Vector Autoregressive Models — PyMC example gallery . While executing it, I am experiencing a very strong difference in sampling time with respect to what is shown in the link above: a few minutes in the example above vs tens of hours in my local run.
Original example:


while on my decently powerful Ubuntu machine:

What is the reason for such a dramatic difference, given that the code is identical?

Here is the watermark of my execution:

Last updated: Mon May 22 2023

Python implementation: CPython
Python version       : 3.11.3
IPython version      : 8.13.2

pytensor: 2.11.1
aeppl   : not installed
xarray  : 2023.4.2

arviz      : 0.15.1
numpy      : 1.24.3
pymc       : 5.3.0
matplotlib : 3.7.1
pandas     : 2.0.1
statsmodels: 0.14.0

Watermark: 2.3.1

When you import PyMC, do you see the warning: WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions ?

Nope:

Just a tqdm/notebook warning. No warnings at all if I import from IPython console or Python console.

Thanks for the reply.

I ran the notebook on my laptop and got the advertised sampling time. Assuming no copy-paste errors, I’m still suspecting that your BLAS installation isn’t correct. Did you install pymc using the recommended way? Could you try to run python -m pytensor.misc.check_blas in a terminal and check that the blas__ldflags line is not empy, as in this case?

That’s a good guess indeed. But blas__ldflags is not empty in my case:

$ python -m pytensor.misc.check_blas

[...omitting the fixed text at the beginning of the output...]

Some PyTensor flags:
    blas__ldflags= -L/home/ubuntu/miniconda3/envs/pymc/lib -lcblas -lblas -lcblas -lblas
    compiledir= /home/ubuntu/.pytensor/compiledir_Linux-5.15--aws-x86_64-with-glibc2.31-x86_64-3.11.3-64
    floatX= float64
    device= cpu
Some OS information:
    sys.platform= linux
    sys.version= 3.11.3 | packaged by conda-forge | (main, Apr  6 2023, 08:57:19) [GCC 11.3.0]
    sys.prefix= /home/ubuntu/miniconda3/envs/pymc
Some environment variables:
    MKL_NUM_THREADS= None
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None

Numpy config: (used when the PyTensor flag "blas__ldflags" is empty)
blas_info:
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/home/ubuntu/miniconda3/envs/pymc/lib']
    include_dirs = ['/home/ubuntu/miniconda3/envs/pymc/include']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/home/ubuntu/miniconda3/envs/pymc/lib']
    include_dirs = ['/home/ubuntu/miniconda3/envs/pymc/include']
    language = c
lapack_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas']
    library_dirs = ['/home/ubuntu/miniconda3/envs/pymc/lib']
    language = f77
lapack_opt_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/home/ubuntu/miniconda3/envs/pymc/lib']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/ubuntu/miniconda3/envs/pymc/include']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX
    not found = AVX512_CNL,AVX512_ICL
Numpy dot module: numpy
Numpy location: /home/ubuntu/miniconda3/envs/pymc/lib/python3.11/site-packages/numpy/__init__.py
Numpy version: 1.24.3

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 1.51s on CPU (with direct PyTensor binding to blas).

Try to run this script a few times. Experience shows that the first time is not as fast as following calls. The difference is not big, but consistent.

Yes, I installed following the recommended way.

To be completely sure, I just re-built from scratch the installation using the recommended installation instruction for Linux, to which I added:

conda install -c conda-forge arviz statsmodels python-graphviz ipython

I ran the example cell-by-cell within IPython till I reached the exact same issue:

Here is the BLAS check output:

$ python -m pytensor.misc.check_blas

[...cut...]

Some PyTensor flags:
    blas__ldflags= -L/home/ubuntu/miniconda3/envs/pymc_env/lib -lcblas -lblas -lcblas -lblas
    compiledir= /home/ubuntu/.pytensor/compiledir_Linux-5.15--aws-x86_64-with-glibc2.31-x86_64-3.11.3-64
    floatX= float64
    device= cpu
Some OS information:
    sys.platform= linux
    sys.version= 3.11.3 | packaged by conda-forge | (main, Apr  6 2023, 08:57:19) [GCC 11.3.0]
    sys.prefix= /home/ubuntu/miniconda3/envs/pymc_env
Some environment variables:
    MKL_NUM_THREADS= None
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None

Numpy config: (used when the PyTensor flag "blas__ldflags" is empty)
blas_info:
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/home/ubuntu/miniconda3/envs/pymc_env/lib']
    include_dirs = ['/home/ubuntu/miniconda3/envs/pymc_env/include']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/home/ubuntu/miniconda3/envs/pymc_env/lib']
    include_dirs = ['/home/ubuntu/miniconda3/envs/pymc_env/include']
    language = c
lapack_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas']
    library_dirs = ['/home/ubuntu/miniconda3/envs/pymc_env/lib']
    language = f77
lapack_opt_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['/home/ubuntu/miniconda3/envs/pymc_env/lib']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['/home/ubuntu/miniconda3/envs/pymc_env/include']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX
    not found = AVX512_CNL,AVX512_ICL
Numpy dot module: numpy
Numpy location: /home/ubuntu/miniconda3/envs/pymc_env/lib/python3.11/site-packages/numpy/__init__.py
Numpy version: 1.24.3

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 1.59s on CPU (with direct PyTensor binding to blas).

Try to run this script a few times. Experience shows that the first time is not as fast as following calls. The difference is not big, but consistent.

Note: the machine I’m using has many CPUs, but only two are used by the code in execution.

Note2: if I use another PyMC code/model on the same machine/environment, sampling is much faster and uses multiple cores. So it seems a problem specific to the example.

Additional information:

  • At the very beginning of the sampling process all cores are used (the machine has many cores) and the effect is very visible via htop, but after a few seconds only two cores remain 100% and everything else stays 0%. The speed of sampling becomes extremely low.

  • I tried to manually control the number of cores/threads via OMP_NUM_THREADS=... ipython and by setting pm.sample(cores=...): it worked as expected, so for a few seconds the given number of cores was the expected one. But after that, again the same problem as above.

What happens if you set cores=1? Does it not get stuck then?

Thank you @ricardoV94 , the problem of extremely low sampling rate is now solved :+1: if one of the following happens:

  • cores=1, as you suggested. Unexpectedly, setting that value, unleashes all cores of my machine, leading to a sampling time of approximately 12 minutes for 3000 draws for each sequential chains (2 chains), plus 3 minutes to sample another 4000 draws (4 chains, one core).
  • nuts_sampler="blackjax". Uses only one core but gets 8000 draws in 6 minutes.
  • nuts_sampler="numpyro". Uses 4 cores (one per chain) for 30 seconds (3000 draws), then 6 minutes one core only for 8000 draws.

With the default nuts_sampler="pymc" without setting cores to 1, it takes many many hours (and uses two cores).

What is the rationale behind these observations? Why cores=1 saves the day?

I am not sure, I faced something similar recently but couldn’t pinpoint. Can I ask you to make a silly experiment?

Can you try to run this dumb model with multiple cores before running the big one? In the same notebook/script you were trying to run the BVAR

with pm.Model() as m:
  pm.Normal("x")
  pm.sample()

Does it also hang? When it doesn’t, does the big model afterwards hang?

Hi @ricardoV94 , I ran what you proposed:

  • the simple model does not hang and samples in 1 second,
  • irrespective of where I put the simple model (= before the context manager of the big model, or even inside the context manager but before pm.sample() of the big model), the result is always the same: the big model is super slow if I don’t manually specify an advanced nuts_sample=... or set cores=1 as you suggested.

In some sense, the Auto-assigning NUTS sampler... stage selects something that is not good for that model, at least on all machines I tried: 3 different ones, all Ubuntu 20.04, but with significantly different hardware. I’d be curious to understand what configuration has @jessegrabowski , since the notebook worked as expected in his case.

It’s an M1 macbook with a fresh environment, running in a jupyter notebook.

Thanks @emanuele it doesn’t sound like the issue I found then.

Hi @jessegrabowski , thank you for your reply. I tried the same example on an M1 MacBook too (via mambaforge for arm64) as you did, and indeed the problem does not show up: default sampling time is as advertised in the notebook. So the one described in this thread seems a specific issue of x86_64, or maybe just Ubuntu?

Would you or @ricardoV94 suggest I file an issue on GitHub?

Without a way of reproducing or a confirmation from someone else that they found the same problem, I don’t think it’s useful as an issue there.

True, reproducibility is essential.

So, I am attaching below the minimal .py file extracted from the first part of the notebook that shows the problem, which is the sampling time exploding on some architectures, when using default values for the sampler (“Auto-assigning NUTS sampler...”): time goes from minutes to tens of hours.

conda create -c conda-forge -n pymc_env "pymc>=4"
conda activate pymc_env
  • I consistently DO NOT observe the issue on Macbook M1 (arm64), same as @jessegrabowski : the sampling time is just a few minutes, as it should.

I invite everyone reading this post to test, especially on x86_64, and report here.

Note: by changing part of line 113 of the code attached below, from

pm.sample(draws=2000, random_seed=130)

to

pm.sample(cores=1, draws=2000, random_seed=130)

or by specifying another NUTS sampler, like blackjax or numpyro, the problem is solved also on x86_64.

bug.py (4.7 KB)

Okay that’s extremely useful. I take back my statement and welcome you to open a Github issue!

1 Like

Done!

1 Like

I ran bug.py on my windows machine in a fresh environment, as well as in Ubuntu via wls. I can confirm slow sampling in Linux, but not windows.

2 Likes