Edit → samples blazing fast after standardizing data.
Hey all! I’m using pymc in a Kubernetes/Kubeflow notebook, in its own conda vertual environment (per the official instructions). However, sampling (prior predictive, posterior, posterior predictive) is slower than expected, about as slow as my local machine on a fairly simple model. The notebook server has 64GB of memory, 4CPUs, 2 GPUs.
I’m on pymc version 4.1.3, aesara version 2.7.7
I don’t see the “Numpy BLAS functions” warning, but I’m wondering if my installation is using aesara correctly? Any help would be much appreciated.
Per this thread, i checked python -m aesara.misc.check_blas
and here’s the output of that, though i am not sure how to read it:
~$ python -m aesara.misc.check_blas
Some results that you can compare against. They were 10 executions of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000). All memory layout was in C order. CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB), Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB), Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB), Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?) Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled), Core i7 950(3.07GHz, hyper-threads enabled) Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled) Libraries tested: * numpy with ATLAS from distribution (FC9) package (1 thread) * manually compiled numpy and ATLAS with 2 threads * goto 1.26 with 1, 2, 4 and 8 threads * goto2 1.13 compiled with multiple threads enabled Xeon Xeon Xeon Core2 i7 i7 Xeon Xeon lib/nb threads E5345 E5430 E5450 E8500 930 950 X5560 X5550 numpy 1.3.0 blas 775.92s numpy_FC9_atlas/1 39.2s 35.0s 30.7s 29.6s 21.5s 19.60s goto/1 18.7s 16.1s 14.2s 13.7s 16.1s 14.67s numpy_MAN_atlas/2 12.0s 11.6s 10.2s 9.2s 9.0s goto/2 9.5s 8.1s 7.1s 7.3s 8.1s 7.4s goto/4 4.9s 4.4s 3.7s - 4.1s 3.8s goto/8 2.7s 2.4s 2.0s - 4.1s 3.8s openblas/1 14.04s openblas/2 7.16s openblas/4 3.71s openblas/8 3.70s mkl 11.0.083/1 7.97s mkl 10.2.2.025/1 13.7s mkl 10.2.2.025/2 7.6s mkl 10.2.2.025/4 4.0s mkl 10.2.2.025/8 2.0s goto2 1.13/1 14.37s goto2 1.13/2 7.26s goto2 1.13/4 3.70s goto2 1.13/8 1.94s goto2 1.13/16 3.16s Test time in float32. There were 10 executions of gemm in float32 with matrices of shape 5000x5000 (M=N=K=5000) All memory layout was in C order. cuda version 8.0 7.5 7.0 gpu M40 0.45s 0.47s k80 0.92s 0.96s K6000/NOECC 0.71s 0.69s P6000/NOECC 0.25s Titan X (Pascal) 0.28s GTX Titan X 0.45s 0.45s 0.47s GTX Titan Black 0.66s 0.64s 0.64s GTX 1080 0.35s GTX 980 Ti 0.41s GTX 970 0.66s GTX 680 1.57s GTX 750 Ti 2.01s 2.01s GTX 750 2.46s 2.37s GTX 660 2.32s 2.32s GTX 580 2.42s GTX 480 2.87s TX1 7.6s (float32 storage and computation) GT 610 33.5s
Some Aesara flags:
blas__ldflags= -L/opt/conda/envs/pymc_env/lib -lmkl_rt -lpthread -lm -lm
compiledir= /home/jovyan/.aesara/compiledir_Linux-5.4-amzn2.x86_64-x86_64-with-glibc2.31-x86_64-3.10.5-64
floatX= float64
device= cpu
Some OS information:
sys.platform= linux
sys.version= 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:06:46) [GCC 10.3.0]
sys.prefix= /opt/conda/envs/pymc_env
Some environment variables:
MKL_NUM_THREADS= None
OMP_NUM_THREADS= None
GOTO_NUM_THREADS= NoneNumpy config: (used when the Aesara flag “blas__ldflags” is empty)
blas_info:
libraries = [‘cblas’, ‘blas’, ‘cblas’, ‘blas’]
library_dirs = [‘/opt/conda/envs/pymc_env/lib’]
include_dirs = [‘/opt/conda/envs/pymc_env/include’]
language = c
define_macros = [(‘HAVE_CBLAS’, None)]
blas_opt_info:
define_macros = [(‘NO_ATLAS_INFO’, 1), (‘HAVE_CBLAS’, None)]
libraries = [‘cblas’, ‘blas’, ‘cblas’, ‘blas’]
library_dirs = [‘/opt/conda/envs/pymc_env/lib’]
include_dirs = [‘/opt/conda/envs/pymc_env/include’]
language = c
lapack_info:
libraries = [‘lapack’, ‘blas’, ‘lapack’, ‘blas’]
library_dirs = [‘/opt/conda/envs/pymc_env/lib’]
language = f77
lapack_opt_info:
libraries = [‘lapack’, ‘blas’, ‘lapack’, ‘blas’, ‘cblas’, ‘blas’, ‘cblas’, ‘blas’]
library_dirs = [‘/opt/conda/envs/pymc_env/lib’]
language = c
define_macros = [(‘NO_ATLAS_INFO’, 1), (‘HAVE_CBLAS’, None)]
include_dirs = [‘/opt/conda/envs/pymc_env/include’]
Supported SIMD extensions in this NumPy install:
baseline = SSE,SSE2,SSE3
found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
Numpy dot module: numpy
Numpy location: /opt/conda/envs/pymc_env/lib/python3.10/site-packages/numpy/init.py
Numpy version: 1.23.1We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).
Total execution time: 6.18s on CPU (with direct Aesara binding to blas).
Try to run this script a few times. Experience shows that the first time is not as fast as following calls. The difference is not big, but consistent.