Diagnosing Pymc v4 slow sampling - linux, kubernetes notebook

Edit → samples blazing fast after standardizing data.

Hey all! I’m using pymc in a Kubernetes/Kubeflow notebook, in its own conda vertual environment (per the official instructions). However, sampling (prior predictive, posterior, posterior predictive) is slower than expected, about as slow as my local machine on a fairly simple model. The notebook server has 64GB of memory, 4CPUs, 2 GPUs.
I’m on pymc version 4.1.3, aesara version 2.7.7

I don’t see the “Numpy BLAS functions” warning, but I’m wondering if my installation is using aesara correctly? Any help would be much appreciated.

Per this thread, i checked python -m aesara.misc.check_blas and here’s the output of that, though i am not sure how to read it:

~$ python -m aesara.misc.check_blas

    Some results that you can compare against. They were 10 executions
    of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
    All memory layout was in C order.

    CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
                Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
                Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
                Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
                Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
                Core i7 950(3.07GHz, hyper-threads enabled)
                Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)

    Libraries tested:
        * numpy with ATLAS from distribution (FC9) package (1 thread)
        * manually compiled numpy and ATLAS with 2 threads
        * goto 1.26 with 1, 2, 4 and 8 threads
        * goto2 1.13 compiled with multiple threads enabled

                      Xeon   Xeon   Xeon  Core2 i7    i7     Xeon   Xeon
    lib/nb threads    E5345  E5430  E5450 E8500 930   950    X5560  X5550

    numpy 1.3.0 blas                                                775.92s
    numpy_FC9_atlas/1 39.2s  35.0s  30.7s 29.6s 21.5s 19.60s
    goto/1            18.7s  16.1s  14.2s 13.7s 16.1s 14.67s
    numpy_MAN_atlas/2 12.0s  11.6s  10.2s  9.2s  9.0s
    goto/2             9.5s   8.1s   7.1s  7.3s  8.1s  7.4s
    goto/4             4.9s   4.4s   3.7s  -     4.1s  3.8s
    goto/8             2.7s   2.4s   2.0s  -     4.1s  3.8s
    openblas/1                                        14.04s
    openblas/2                                         7.16s
    openblas/4                                         3.71s
    openblas/8                                         3.70s
    mkl 11.0.083/1            7.97s
    mkl                                         13.7s
    mkl                                          7.6s
    mkl                                          4.0s
    mkl                                          2.0s
    goto2 1.13/1                                                     14.37s
    goto2 1.13/2                                                      7.26s
    goto2 1.13/4                                                      3.70s
    goto2 1.13/8                                                      1.94s
    goto2 1.13/16                                                     3.16s

    Test time in float32. There were 10 executions of gemm in
    float32 with matrices of shape 5000x5000 (M=N=K=5000)
    All memory layout was in C order.

    cuda version      8.0    7.5    7.0
    M40               0.45s  0.47s
    k80               0.92s  0.96s
    K6000/NOECC       0.71s         0.69s
    P6000/NOECC       0.25s

    Titan X (Pascal)  0.28s
    GTX Titan X       0.45s  0.45s  0.47s
    GTX Titan Black   0.66s  0.64s  0.64s
    GTX 1080          0.35s
    GTX 980 Ti               0.41s
    GTX 970                  0.66s
    GTX 680                         1.57s
    GTX 750 Ti               2.01s  2.01s
    GTX 750                  2.46s  2.37s
    GTX 660                  2.32s  2.32s
    GTX 580                  2.42s
    GTX 480                  2.87s
    TX1                             7.6s (float32 storage and computation)
    GT 610                          33.5s

Some Aesara flags:
blas__ldflags= -L/opt/conda/envs/pymc_env/lib -lmkl_rt -lpthread -lm -lm
compiledir= /home/jovyan/.aesara/compiledir_Linux-5.4-amzn2.x86_64-x86_64-with-glibc2.31-x86_64-3.10.5-64
floatX= float64
device= cpu
Some OS information:
sys.platform= linux
sys.version= 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:06:46) [GCC 10.3.0]
sys.prefix= /opt/conda/envs/pymc_env
Some environment variables:

Numpy config: (used when the Aesara flag “blas__ldflags” is empty)
libraries = [‘cblas’, ‘blas’, ‘cblas’, ‘blas’]
library_dirs = [‘/opt/conda/envs/pymc_env/lib’]
include_dirs = [‘/opt/conda/envs/pymc_env/include’]
language = c
define_macros = [(‘HAVE_CBLAS’, None)]
define_macros = [(‘NO_ATLAS_INFO’, 1), (‘HAVE_CBLAS’, None)]
libraries = [‘cblas’, ‘blas’, ‘cblas’, ‘blas’]
library_dirs = [‘/opt/conda/envs/pymc_env/lib’]
include_dirs = [‘/opt/conda/envs/pymc_env/include’]
language = c
libraries = [‘lapack’, ‘blas’, ‘lapack’, ‘blas’]
library_dirs = [‘/opt/conda/envs/pymc_env/lib’]
language = f77
libraries = [‘lapack’, ‘blas’, ‘lapack’, ‘blas’, ‘cblas’, ‘blas’, ‘cblas’, ‘blas’]
library_dirs = [‘/opt/conda/envs/pymc_env/lib’]
language = c
define_macros = [(‘NO_ATLAS_INFO’, 1), (‘HAVE_CBLAS’, None)]
include_dirs = [‘/opt/conda/envs/pymc_env/include’]
Supported SIMD extensions in this NumPy install:
baseline = SSE,SSE2,SSE3
not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
Numpy dot module: numpy
Numpy location: /opt/conda/envs/pymc_env/lib/python3.10/site-packages/numpy/init.py
Numpy version: 1.23.1

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 6.18s on CPU (with direct Aesara binding to blas).

Try to run this script a few times. Experience shows that the first time is not as fast as following calls. The difference is not big, but consistent.

Why do you expect the cloud node to be significantly faster than your desktop/laptop? Because it’s a faster CPU or because of GPUs?

If the former, I doubt the cloud instance is much higher specced than your local box.

If the latter, you are most likely not using GPU sampling here.

In any case, it looks like MKL is used correctly which is usually the source of slowness, but I don’t suspect your sampling to be unreasonably slow.

1 Like

Makes sense, had assumed the cloud would be faster, this was first time trying pymc on it. I haven’t attempted to use jax on GPU yet. I didn’t realize just how much standardizing the data would do to speed up sampling, but that’s all I needed. Thanks so much for the quick reply! :pray: