Thank you @jessegrabowski for your suggestion. I ran the command you suggested and got:
print(os.path.dirname(pytensor.__file__))"`/misc/check_blas.py
Some results that you can compare against. They were 10 executions
of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
All memory layout was in C order.
CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
Core i7 950(3.07GHz, hyper-threads enabled)
Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)
Libraries tested:
* numpy with ATLAS from distribution (FC9) package (1 thread)
* manually compiled numpy and ATLAS with 2 threads
* goto 1.26 with 1, 2, 4 and 8 threads
* goto2 1.13 compiled with multiple threads enabled
Xeon Xeon Xeon Core2 i7 i7 Xeon Xeon
lib/nb threads E5345 E5430 E5450 E8500 930 950 X5560 X5550
numpy 1.3.0 blas 775.92s
numpy_FC9_atlas/1 39.2s 35.0s 30.7s 29.6s 21.5s 19.60s
goto/1 18.7s 16.1s 14.2s 13.7s 16.1s 14.67s
numpy_MAN_atlas/2 12.0s 11.6s 10.2s 9.2s 9.0s
goto/2 9.5s 8.1s 7.1s 7.3s 8.1s 7.4s
goto/4 4.9s 4.4s 3.7s - 4.1s 3.8s
goto/8 2.7s 2.4s 2.0s - 4.1s 3.8s
openblas/1 14.04s
openblas/2 7.16s
openblas/4 3.71s
openblas/8 3.70s
mkl 11.0.083/1 7.97s
mkl 10.2.2.025/1 13.7s
mkl 10.2.2.025/2 7.6s
mkl 10.2.2.025/4 4.0s
mkl 10.2.2.025/8 2.0s
goto2 1.13/1 14.37s
goto2 1.13/2 7.26s
goto2 1.13/4 3.70s
goto2 1.13/8 1.94s
goto2 1.13/16 3.16s
Test time in float32. There were 10 executions of gemm in
float32 with matrices of shape 5000x5000 (M=N=K=5000)
All memory layout was in C order.
cuda version 8.0 7.5 7.0
gpu
M40 0.45s 0.47s
k80 0.92s 0.96s
K6000/NOECC 0.71s 0.69s
P6000/NOECC 0.25s
Titan X (Pascal) 0.28s
GTX Titan X 0.45s 0.45s 0.47s
GTX Titan Black 0.66s 0.64s 0.64s
GTX 1080 0.35s
GTX 980 Ti 0.41s
GTX 970 0.66s
GTX 680 1.57s
GTX 750 Ti 2.01s 2.01s
GTX 750 2.46s 2.37s
GTX 660 2.32s 2.32s
GTX 580 2.42s
GTX 480 2.87s
TX1 7.6s (float32 storage and computation)
GT 610 33.5s
Some PyTensor flags:
blas__ldflags= -L/opt/miniconda3/envs/pymc_env/lib -llapack -lblas -lcblas -lm -Wl,-rpath,/opt/miniconda3/envs/pymc_env/lib
compiledir= /Users/dekermanjian/.pytensor/compiledir_macOS-14.5-arm64-arm-64bit-arm-3.12.3-64
floatX= float64
device= cpu
Some OS information:
sys.platform= darwin
sys.version= 3.12.3 | packaged by conda-forge | (main, Apr 15 2024, 18:35:20) [Clang 16.0.6 ]
sys.prefix= /opt/miniconda3/envs/pymc_env
Some environment variables:
MKL_NUM_THREADS= None
OMP_NUM_THREADS= None
GOTO_NUM_THREADS= None
Numpy config: (used when the PyTensor flag "blas__ldflags" is empty)
Build Dependencies:
blas:
detection method: pkgconfig
found: true
include directory: /opt/miniconda3/envs/pymc_env/include
lib directory: /opt/miniconda3/envs/pymc_env/lib
name: blas
openblas configuration: unknown
pc file directory: /opt/miniconda3/envs/pymc_env/lib/pkgconfig
version: 3.9.0
lapack:
detection method: internal
found: true
include directory: unknown
lib directory: unknown
name: dep4569863840
openblas configuration: unknown
pc file directory: unknown
version: 1.26.4
Compilers:
c:
args: -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -isystem,
/opt/miniconda3/envs/pymc_env/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4,
-fdebug-prefix-map=/opt/miniconda3/envs/pymc_env=/usr/local/src/conda-prefix,
-D_FORTIFY_SOURCE=2, -isystem, /opt/miniconda3/envs/pymc_env/include, -mmacosx-version-min=11.0
commands: arm64-apple-darwin20.0.0-clang
linker: ld64
linker args: -Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs, -Wl,-rpath,/opt/miniconda3/envs/pymc_env/lib,
-L/opt/miniconda3/envs/pymc_env/lib, -ftree-vectorize, -fPIC, -fstack-protector-strong,
-O2, -pipe, -isystem, /opt/miniconda3/envs/pymc_env/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4,
-fdebug-prefix-map=/opt/miniconda3/envs/pymc_env=/usr/local/src/conda-prefix,
-D_FORTIFY_SOURCE=2, -isystem, /opt/miniconda3/envs/pymc_env/include, -mmacosx-version-min=11.0
name: clang
version: 16.0.6
c++:
args: -ftree-vectorize, -fPIC, -fstack-protector-strong, -O2, -pipe, -stdlib=libc++,
-fvisibility-inlines-hidden, -fmessage-length=0, -isystem, /opt/miniconda3/envs/pymc_env/include,
-fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4,
-fdebug-prefix-map=/opt/miniconda3/envs/pymc_env=/usr/local/src/conda-prefix,
-D_FORTIFY_SOURCE=2, -isystem, /opt/miniconda3/envs/pymc_env/include, -mmacosx-version-min=11.0
commands: arm64-apple-darwin20.0.0-clang++
linker: ld64
linker args: -Wl,-headerpad_max_install_names, -Wl,-dead_strip_dylibs, -Wl,-rpath,/opt/miniconda3/envs/pymc_env/lib,
-L/opt/miniconda3/envs/pymc_env/lib, -ftree-vectorize, -fPIC, -fstack-protector-strong,
-O2, -pipe, -stdlib=libc++, -fvisibility-inlines-hidden, -fmessage-length=0,
-isystem, /opt/miniconda3/envs/pymc_env/include, -fdebug-prefix-map=/Users/runner/miniforge3/conda-bld/numpy_1707225421156/work=/usr/local/src/conda/numpy-1.26.4,
-fdebug-prefix-map=/opt/miniconda3/envs/pymc_env=/usr/local/src/conda-prefix,
-D_FORTIFY_SOURCE=2, -isystem, /opt/miniconda3/envs/pymc_env/include, -mmacosx-version-min=11.0
name: clang
version: 16.0.6
cython:
commands: cython
linker: cython
name: cython
version: 3.0.8
Machine Information:
build:
cpu: aarch64
endian: little
family: aarch64
system: darwin
cross-compiled: true
host:
cpu: arm64
endian: little
family: aarch64
system: darwin
Python Information:
path: /opt/miniconda3/envs/pymc_env/bin/python
version: '3.12'
SIMD Extensions:
baseline:
- NEON
- NEON_FP16
- NEON_VFPV4
- ASIMD
found:
- ASIMDHP
not found:
- ASIMDFHM
Numpy dot module: numpy
Numpy location: /opt/miniconda3/envs/pymc_env/lib/python3.12/site-packages/numpy/__init__.py
Numpy version: 1.26.4
We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).
Total execution time: 7.34s on CPU (with direct PyTensor binding to blas).
Try to run this script a few times. Experience shows that the first time is not as fast as following calls. The difference is not big, but consistent.
There are two areas that mention the blas__ldflags one points to some path and the other says it is used when it is empty. I am not sure how to interpret that.