Parallelization at the BLAS Level when Sampling

During sampling, not all CPU cores are utilized. The number of CPU threads used by sample seems to be only the parameter cores. There seems to be no BLAS parallelization. My PyTensor works fine. It correctly takes appropriate environment variables (MKL_NUM_THREADS or OPENBLAS_NUM_THREADS) to adjust the thread count. I suspect the possible causes are

  1. my model being too simple
  2. my data size being too small
  3. using AMD with MKL
  4. other problems

Is there any way to check my PyMC setup to eliminate some causes?

The code I used to verify the sampling behavior is

import pymc as pm
import numpy as np

with pm.Model() as model:
    x = pm.Normal("x", mu=0, sigma=3, shape=1000)
    obs = np.random.normal(size=(1000, 1000)) # (10_000, 1_000) behaves the same
    pm.Normal("y", mu=x, observed=obs)
    pm.sample(cores=1, chains=1, draws=1000)

The NumPy C-API also does not parallelize, but it may be caused by another reason, e.g., by design, and not important.

Config

OS: Windows 10
CPU: Ryzen 3700x, 8 cores, 16 threads
Python Env: Conda

PyMC Python NumPy BLAS
5.6.1 3.10.12 1.25.2 MKL 2023.1.0
5.6.1 3.11.5 1.25.2 OpenBLAS 0.3.24

MKL

check_blas.py completed with 9.11s and a 50% peak CPU usage. With MKL_NUM_THREADS=1, it completed with 42.88s and an 8.4% peak CPU usage.

check_blas.py Output

        Some results that you can compare against. They were 10 executions
        of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
        All memory layout was in C order.

        CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
                    Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
                    Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
                    Core i7 950(3.07GHz, hyper-threads enabled)
                    Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)


        Libraries tested:
            * numpy with ATLAS from distribution (FC9) package (1 thread)
            * manually compiled numpy and ATLAS with 2 threads
            * goto 1.26 with 1, 2, 4 and 8 threads
            * goto2 1.13 compiled with multiple threads enabled

                          Xeon   Xeon   Xeon  Core2 i7    i7     Xeon   Xeon
        lib/nb threads    E5345  E5430  E5450 E8500 930   950    X5560  X5550

        numpy 1.3.0 blas                                                775.92s
        numpy_FC9_atlas/1 39.2s  35.0s  30.7s 29.6s 21.5s 19.60s
        goto/1            18.7s  16.1s  14.2s 13.7s 16.1s 14.67s
        numpy_MAN_atlas/2 12.0s  11.6s  10.2s  9.2s  9.0s
        goto/2             9.5s   8.1s   7.1s  7.3s  8.1s  7.4s
        goto/4             4.9s   4.4s   3.7s  -     4.1s  3.8s
        goto/8             2.7s   2.4s   2.0s  -     4.1s  3.8s
        openblas/1                                        14.04s
        openblas/2                                         7.16s
        openblas/4                                         3.71s
        openblas/8                                         3.70s
        mkl 11.0.083/1            7.97s
        mkl 10.2.2.025/1                                         13.7s
        mkl 10.2.2.025/2                                          7.6s
        mkl 10.2.2.025/4                                          4.0s
        mkl 10.2.2.025/8                                          2.0s
        goto2 1.13/1                                                     14.37s
        goto2 1.13/2                                                      7.26s
        goto2 1.13/4                                                      3.70s
        goto2 1.13/8                                                      1.94s
        goto2 1.13/16                                                     3.16s

        Test time in float32. There were 10 executions of gemm in
        float32 with matrices of shape 5000x5000 (M=N=K=5000)
        All memory layout was in C order.


        cuda version      8.0    7.5    7.0
        gpu
        M40               0.45s  0.47s
        k80               0.92s  0.96s
        K6000/NOECC       0.71s         0.69s
        P6000/NOECC       0.25s

        Titan X (Pascal)  0.28s
        GTX Titan X       0.45s  0.45s  0.47s
        GTX Titan Black   0.66s  0.64s  0.64s
        GTX 1080          0.35s
        GTX 980 Ti               0.41s
        GTX 970                  0.66s
        GTX 680                         1.57s
        GTX 750 Ti               2.01s  2.01s
        GTX 750                  2.46s  2.37s
        GTX 660                  2.32s  2.32s
        GTX 580                  2.42s
        GTX 480                  2.87s
        TX1                             7.6s (float32 storage and computation)
        GT 610                          33.5s

Some PyTensor flags:
    blas__ldflags= -L"C:\Users\user\.conda\envs\pm5\Library\bin" -lmkl_core -lmkl_intel_thread -lmkl_rt
    compiledir= C:\Users\user\AppData\Local\PyTensor\compiledir_Windows-10-10.0.19045-SP0-AMD64_Family_23_Model_113_Stepping_0_AuthenticAMD-3.10.12-64
    floatX= float64
    device= cpu
Some OS information:
    sys.platform= win32
    sys.version= 3.10.12 | packaged by Anaconda, Inc. | (main, Jul  5 2023, 19:09:20) [MSC v.1916 64 bit (AMD64)]
    sys.prefix= C:\Users\user\.conda\envs\pm5
Some environment variables:
    MKL_NUM_THREADS= None
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None

Numpy config: (used when the PyTensor flag "blas__ldflags" is empty)
blas_armpl_info:
  NOT AVAILABLE
blas_mkl_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/Users/user/.conda/envs/pm5\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/Users/user/.conda/envs/pm5\\Library\\include']
blas_opt_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/Users/user/.conda/envs/pm5\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/Users/user/.conda/envs/pm5\\Library\\include']
lapack_armpl_info:
  NOT AVAILABLE
lapack_mkl_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/Users/user/.conda/envs/pm5\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/Users/user/.conda/envs/pm5\\Library\\include']
lapack_opt_info:
    libraries = ['mkl_rt']
    library_dirs = ['C:/Users/user/.conda/envs/pm5\\Library\\lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/Users/user/.conda/envs/pm5\\Library\\include']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
    not found = AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
Numpy dot module: numpy
Numpy location: C:\Users\user\.conda\envs\pm5\lib\site-packages\numpy\__init__.py
Numpy version: 1.25.2

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 9.11s on CPU (with direct PyTensor binding to blas).

Try to run this script a few times. Experience shows that the first time is not as fast as following calls. The difference is not big, but consistent.
pytensor.config Output
floatX ({'float32', 'float64', 'float16'}) 
    Doc:  Default floating-point precision for python casts.

Note: float16 support is experimental, use at your own risk.
    Value:  float64

warn_float64 ({'warn', 'raise', 'ignore', 'pdb'}) 
    Doc:  Do an action when a tensor variable with float64 dtype is created.
    Value:  ignore

pickle_test_value (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FADD80>>) 
    Doc:  Dump test values while pickling model. If True, test values will be dumped with model.
    Value:  True

cast_policy ({'custom', 'numpy+floatX'}) 
    Doc:  Rules for implicit type casting
    Value:  custom

deterministic ({'default', 'more'}) 
    Doc:  If `more`, sometimes we will select some implementation that are more deterministic, but slower.  Also see the dnn.conv.algo* flags to cover more cases.
    Value:  default

device (cpu)
    Doc:  Default device for computations. only cpu is supported for now
    Value:  cpu

force_device (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FADED0>>) 
    Doc:  Raise an error if we can't use the specified device
    Value:  False

conv__assert_shape (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FADF30>>) 
    Doc:  If True, AbstractConv* ops will verify that user-provided shapes match the runtime shapes (debugging option, may slow down compilation)
    Value:  False

print_global_stats (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FADF60>>) 
    Doc:  Print some global statistics (time spent) at the end
    Value:  False

assert_no_cpu_op ({'warn', 'raise', 'ignore', 'pdb'}) 
    Doc:  Raise an error/warning if there is a CPU op in the computational graph.
    Value:  ignore

unpickle_function (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FADFF0>>) 
    Doc:  Replace unpickled PyTensor functions with None. This is useful to unpickle old graphs that pickled them when it shouldn't
    Value:  True

<pytensor.configparser.ConfigParam object at 0x0000010782FAE020>
    Doc:  Default compilation mode
    Value:  Mode

cxx (<class 'str'>) 
    Doc:  The C++ compiler to use. Currently only g++ is supported, but supporting additional compilers should not be too difficult. If it is empty, no C++ code is compiled.
    Value:  "C:\Users\user\.conda\envs\pm5\Library\mingw-w64\bin\g++.exe"

linker ({'c|py', 'cvm_nogc', 'vm', 'py', 'vm_nogc', 'cvm', 'c', 'c|py_nogc'}) 
    Doc:  Default linker used if the pytensor flags mode is Mode
    Value:  cvm

allow_gc (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE290>>) 
    Doc:  Do we default to delete intermediate results during PyTensor function calls? Doing so lowers the memory requirement, but asks that we reallocate memory at the next function call. This is implemented for the default linker, but may not work for all linkers.
    Value:  True

optimizer ({'o1', 'fast_compile', 'o3', 'merge', 'o4', 'o2', 'fast_run', 'None', 'unsafe'}) 
    Doc:  Default optimizer. If not None, will use this optimizer with the Mode
    Value:  o4

optimizer_verbose (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE050>>) 
    Doc:  If True, we print all optimization being applied
    Value:  False

on_opt_error ({'ignore', 'warn', 'raise', 'pdb'}) 
    Doc:  What to do when an optimization crashes: warn and skip it, raise the exception, or fall into the pdb debugger.
    Value:  warn

nocleanup (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE0E0>>) 
    Doc:  Suppress the deletion of code files that did not compile cleanly
    Value:  False

on_unused_input ({'ignore', 'warn', 'raise'}) 
    Doc:  What to do if a variable in the 'inputs' list of  pytensor.function() is not used in the graph.
    Value:  raise

gcc__cxxflags (<class 'str'>) 
    Doc:  Extra compiler flags for gcc
    Value:  

cmodule__warn_no_version (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE320>>) 
    Doc:  If True, will print a warning when compiling one or more Op with C code that can't be cached because there is no c_code_cache_version() function associated to at least one of those Ops.
    Value:  False

cmodule__remove_gxx_opt (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE170>>) 
    Doc:  If True, will remove the -O* parameter passed to g++.This is useful to debug in gdb modules compiled by PyTensor.The parameter -g is passed by default to g++
    Value:  False

cmodule__compilation_warning (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE1A0>>) 
    Doc:  If True, will print compilation warnings.
    Value:  False

cmodule__preload_cache (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE1D0>>) 
    Doc:  If set to True, will preload the C module cache at import time
    Value:  False

cmodule__age_thresh_use (<class 'int'>) 
    Doc:  In seconds. The time after which PyTensor won't reuse a compile c module.
    Value:  2073600

cmodule__debug (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE260>>) 
    Doc:  If True, define a DEBUG macro (if not exists) for any compiled C code.
    Value:  False

compile__wait (<class 'int'>) 
    Doc:  Time to wait before retrying to acquire the compile lock.
    Value:  5

compile__timeout (<class 'int'>) 
    Doc:  In seconds, time that a process will wait before deciding to
    override an existing lock. An override only happens when the existing
    lock is held by the same owner *and* has not been 'refreshed' by this
    owner for more than this period. Refreshes are done every half timeout
    period for running processes.
    Value:  120

ctc__root (<class 'str'>) 
    Doc:  Directory which contains the root of Baidu CTC library. It is assumed         that the compiled library is either inside the build, lib or lib64         subdirectory, and the header inside the include directory.
    Value:  

tensor__cmp_sloppy (<class 'int'>) 
    Doc:  Relax pytensor.tensor.math._allclose (0) not at all, (1) a bit, (2) more
    Value:  0

tensor__local_elemwise_fusion (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE3E0>>) 
    Doc:  Enable or not in fast_run mode(fast_run optimization) the elemwise fusion optimization
    Value:  True

lib__amblibm (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE410>>) 
    Doc:  Use amd's amdlibm numerical library
    Value:  False

tensor__insert_inplace_optimizer_validate_nb (<class 'int'>) 
    Doc:  -1: auto, if graph have less then 500 nodes 1, else 10
    Value:  -1

traceback__limit (<class 'int'>) 
    Doc:  The number of stack to trace. -1 mean all.
    Value:  8

traceback__compile_limit (<class 'int'>) 
    Doc:  The number of stack to trace to keep during compilation. -1 mean all. If greater then 0, will also make us save PyTensor internal stack trace.
    Value:  0

experimental__local_alloc_elemwise (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE4D0>>) 
    Doc:  DEPRECATED: If True, enable the experimental optimization local_alloc_elemwise. Generates error if not True. Use optimizer_excluding=local_alloc_elemwise to disable.
    Value:  True

experimental__local_alloc_elemwise_assert (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE500>>) 
    Doc:  When the local_alloc_elemwise is applied, add an assert to highlight shape errors.
    Value:  True

warn__ignore_bug_before ({'0.3', '1.0.3', '0.7', '0.9', '1.0.4', '0.6', '0.8.2', '1.0.5', '1.0', '0.8.1', '1.0.1', 'None', '1.0.2', '0.8', '0.4', '0.10', '0.4.1', '0.5', 'all'}) 
    Doc:  If 'None', we warn about all PyTensor bugs found by default. If 'all', we don't warn about PyTensor bugs found by default. If a version, we print only the warnings relative to PyTensor bugs found after that version. Warning for specific bugs can be configured with specific [warn] flags.
    Value:  0.9

exception_verbosity ({'low', 'high'}) 
    Doc:  If 'low', the text of exceptions will generally refer to apply nodes with short names such as Elemwise{add_no_inplace}. If 'high', some exceptions will also refer to apply nodes with long descriptions  like:
        A. Elemwise{add_no_inplace}
                B. log_likelihood_v_given_h
                C. log_likelihood_h
    Value:  low

print_test_value (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE590>>) 
    Doc:  If 'True', the __eval__ of an PyTensor variable will return its test_value when this is available. This has the practical consequence that, e.g., in debugging `my_var` will print the same as `my_var.tag.test_value` when a test value is defined.
    Value:  False

compute_test_value ({'raise', 'ignore', 'off', 'warn', 'pdb'}) 
    Doc:  If 'True', PyTensor will run each op at graph build time, using Constants, SharedVariables and the tag 'test_value' as inputs to the function. This helps the user track down problems in the graph before it gets optimized.
    Value:  off

compute_test_value_opt ({'raise', 'ignore', 'off', 'warn', 'pdb'}) 
    Doc:  For debugging PyTensor optimization only. Same as compute_test_value, but is used during PyTensor optimization
    Value:  off

check_input (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE620>>) 
    Doc:  Specify if types should check their input in their C code. It can be used to speed up compilation, reduce overhead (particularly for scalars) and reduce the number of generated C files.
    Value:  True

NanGuardMode__nan_is_error (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE650>>) 
    Doc:  Default value for nan_is_error
    Value:  True

NanGuardMode__inf_is_error (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE680>>) 
    Doc:  Default value for inf_is_error
    Value:  True

NanGuardMode__big_is_error (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE6B0>>) 
    Doc:  Default value for big_is_error
    Value:  True

NanGuardMode__action ({'warn', 'raise', 'pdb'}) 
    Doc:  What NanGuardMode does when it finds a problem
    Value:  raise

DebugMode__patience (<class 'int'>) 
    Doc:  Optimize graph this many times to detect inconsistency
    Value:  10

DebugMode__check_c (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE740>>) 
    Doc:  Run C implementations where possible
    Value:  True

DebugMode__check_py (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE770>>) 
    Doc:  Run Python implementations where possible
    Value:  True

DebugMode__check_finite (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE7A0>>) 
    Doc:  True -> complain about NaN/Inf results
    Value:  True

DebugMode__check_strides (<class 'int'>) 
    Doc:  Check that Python- and C-produced ndarrays have same strides. On difference: (0) - ignore, (1) warn, or (2) raise error
    Value:  0

DebugMode__warn_input_not_reused (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE800>>) 
    Doc:  Generate a warning when destroy_map or view_map says that an op works inplace, but the op did not reuse the input for its output.
    Value:  True

DebugMode__check_preallocated_output (<class 'str'>) 
    Doc:  Test thunks with pre-allocated memory as output storage. This is a list of strings separated by ":". Valid values are: "initial" (initial storage in storage map, happens with Scan),"previous" (previously-returned memory), "c_contiguous", "f_contiguous", "strided" (positive and negative strides), "wrong_size" (larger and smaller dimensions), and "ALL" (all of the above).
    Value:  

DebugMode__check_preallocated_output_ndim (<class 'int'>) 
    Doc:  When testing with "strided" preallocated output memory, test all combinations of strides over that number of (inner-most) dimensions. You may want to reduce that number to reduce memory or time usage, but it is advised to keep a minimum of 2.
    Value:  4

profiling__time_thunks (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE890>>) 
    Doc:  Time individual thunks when profiling
    Value:  True

profiling__n_apply (<class 'int'>) 
    Doc:  Number of Apply instances to print by default
    Value:  20

profiling__n_ops (<class 'int'>) 
    Doc:  Number of Ops to print by default
    Value:  20

profiling__output_line_width (<class 'int'>) 
    Doc:  Max line width for the profiling output
    Value:  512

profiling__min_memory_size (<class 'int'>) 
    Doc:  For the memory profile, do not print Apply nodes if the size
                 of their outputs (in bytes) is lower than this threshold
    Value:  1024

profiling__min_peak_memory (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE980>>) 
    Doc:  The min peak memory usage of the order
    Value:  False

profiling__destination (<class 'str'>) 
    Doc:  File destination of the profiling output
    Value:  stderr

profiling__debugprint (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAE9E0>>) 
    Doc:  Do a debugprint of the profiled functions
    Value:  False

profiling__ignore_first_call (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAEA10>>) 
    Doc:  Do we ignore the first call of an PyTensor function.
    Value:  False

on_shape_error ({'warn', 'raise'}) 
    Doc:  warn: print a warning and use the default value. raise: raise an error
    Value:  warn

openmp (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAEA70>>) 
    Doc:  Allow (or not) parallel computation on the CPU with OpenMP. This is the default value used when creating an Op that supports OpenMP parallelization. It is preferable to define it via the PyTensor configuration file ~/.pytensorrc or with the environment variable PYTENSOR_FLAGS. Parallelization is only done for some operations that implement it, and even for operations that implement parallelism, each operation is free to respect this flag or not. You can control the number of threads used with the environment variable OMP_NUM_THREADS. If it is set to 1, we disable openmp in PyTensor by default.
    Value:  False

openmp_elemwise_minsize (<class 'int'>) 
    Doc:  If OpenMP is enabled, this is the minimum size of vectors for which the openmp parallelization is enabled in element wise ops.
    Value:  200000

optimizer_excluding (<class 'str'>) 
    Doc:  When using the default mode, we will remove optimizer with these tags. Separate tags with ':'.
    Value:  

optimizer_including (<class 'str'>) 
    Doc:  When using the default mode, we will add optimizer with these tags. Separate tags with ':'.
    Value:  

optimizer_requiring (<class 'str'>) 
    Doc:  When using the default mode, we will require optimizer with these tags. Separate tags with ':'.
    Value:  

optdb__position_cutoff (<class 'float'>) 
    Doc:  Where to stop earlier during optimization. It represent the position of the optimizer where to stop.
    Value:  inf

optdb__max_use_ratio (<class 'float'>) 
    Doc:  A ratio that prevent infinite loop in EquilibriumGraphRewriter.
    Value:  8.0

cycle_detection ({'fast', 'regular'}) 
    Doc:  If cycle_detection is set to regular, most inplaces are allowed,but it is slower. If cycle_detection is set to faster, less inplacesare allowed, but it makes the compilation faster.The interaction of which one give the lower peak memory usage iscomplicated and not predictable, so if you are close to the peakmemory usage, triyng both could give you a small gain.
    Value:  regular

check_stack_trace ({'warn', 'raise', 'log', 'off'}) 
    Doc:  A flag for checking the stack trace during the optimization process. default (off): does not check the stack trace of any optimization log: inserts a dummy stack trace that identifies the optimizationthat inserted the variable that had an empty stack trace.warn: prints a warning if a stack trace is missing and also a dummystack trace is inserted that indicates which optimization insertedthe variable that had an empty stack trace.raise: raises an exception if a stack trace is missing
    Value:  off

metaopt__verbose (<class 'int'>) 
    Doc:  0 for silent, 1 for only warnings, 2 for full output withtimings and selected implementation
    Value:  0

metaopt__optimizer_excluding (<class 'str'>) 
    Doc:  exclude optimizers with these tags. Separate tags with ':'.
    Value:  

metaopt__optimizer_including (<class 'str'>) 
    Doc:  include optimizers with these tags. Separate tags with ':'.
    Value:  

unittests__rseed (<class 'str'>) 
    Doc:  Seed to use for randomized unit tests. Special value 'random' means using a seed of None.
    Value:  666

warn__round (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAECE0>>) 
    Doc:  Warn when using `tensor.round` with the default mode. Round changed its default from `half_away_from_zero` to `half_to_even` to have the same default as NumPy.
    Value:  False

profile (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAED10>>) 
    Doc:  If VM should collect profile information
    Value:  False

profile_optimizer (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAED40>>) 
    Doc:  If VM should collect optimizer profile information
    Value:  False

profile_memory (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAED70>>) 
    Doc:  If VM should collect memory profile information and print it
    Value:  False

<pytensor.configparser.ConfigParam object at 0x0000010782FAEDA0>
    Doc:  Useful only for the VM Linkers. When lazy is None, auto detect if lazy evaluation is needed and use the appropriate version. If the C loop isn't being used and lazy is True, use the Stack VM; otherwise, use the Loop VM.
    Value:  None

numba__vectorize_target ({'parallel', 'cpu', 'cuda'}) 
    Doc:  Default target for numba.vectorize.
    Value:  cpu

numba__fastmath (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAEE00>>) 
    Doc:  If True, use Numba's fastmath mode.
    Value:  True

numba__cache (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010782FAEE30>>) 
    Doc:  If True, use Numba's file based caching.
    Value:  True

compiledir_format (<class 'str'>) 
    Doc:  Format string for platform-dependent compiled module subdirectory
(relative to base_compiledir). Available keys: device, gxx_version,
hostname, numpy_version, platform, processor, pytensor_version,
python_bitwidth, python_int_bitwidth, python_version, short_platform.
Defaults to compiledir_%(short_platform)s-%(processor)s-
%(python_version)s-%(python_bitwidth)s.
    Value:  compiledir_%(short_platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s

<pytensor.configparser.ConfigParam object at 0x0000010782FAF130>
    Doc:  platform-independent root directory for compiled modules
    Value:  C:\Users\user\AppData\Local\PyTensor

<pytensor.configparser.ConfigParam object at 0x0000010782FAEFE0>
    Doc:  platform-dependent cache directory for compiled modules
    Value:  C:\Users\user\AppData\Local\PyTensor\compiledir_Windows-10-10.0.19045-SP0-AMD64_Family_23_Model_113_Stepping_0_AuthenticAMD-3.10.12-64

blas__ldflags (<class 'str'>) 
    Doc:  lib[s] to include for [Fortran] level-3 blas implementation
    Value:  -L"C:\Users\user\.conda\envs\pm5\Library\bin" -lmkl_core -lmkl_intel_thread -lmkl_rt

blas__check_openmp (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x00000107831F7C40>>) 
    Doc:  Check for openmp library conflict.
WARNING: Setting this to False leaves you open to wrong results in blas-related operations.
    Value:  True

scan__allow_gc (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x0000010785958880>>) 
    Doc:  Allow/disallow gc inside of Scan (default: False)
    Value:  False

scan__allow_output_prealloc (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x00000107853F5000>>) 
    Doc:  Allow/disallow memory preallocation for outputs inside of scan (default: True)
    Value:  True



OpenBLAS

In the next post ↓

OpenBLAS

check_blas.py completed with 10.58s and a 100% peak CPU usage. With OPENBLAS_NUM_THREADS=1, it completed with 43.29s and a 9% peak CPU usage.

check_blas.py Output

        Some results that you can compare against. They were 10 executions
        of gemm in float64 with matrices of shape 2000x2000 (M=N=K=2000).
        All memory layout was in C order.

        CPU tested: Xeon E5345(2.33Ghz, 8M L2 cache, 1333Mhz FSB),
                    Xeon E5430(2.66Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon E5450(3Ghz, 12M L2 cache, 1333Mhz FSB),
                    Xeon X5560(2.8Ghz, 12M L2 cache, hyper-threads?)
                    Core 2 E8500, Core i7 930(2.8Ghz, hyper-threads enabled),
                    Core i7 950(3.07GHz, hyper-threads enabled)
                    Xeon X5550(2.67GHz, 8M l2 cache?, hyper-threads enabled)


        Libraries tested:
            * numpy with ATLAS from distribution (FC9) package (1 thread)
            * manually compiled numpy and ATLAS with 2 threads
            * goto 1.26 with 1, 2, 4 and 8 threads
            * goto2 1.13 compiled with multiple threads enabled

                          Xeon   Xeon   Xeon  Core2 i7    i7     Xeon   Xeon
        lib/nb threads    E5345  E5430  E5450 E8500 930   950    X5560  X5550

        numpy 1.3.0 blas                                                775.92s
        numpy_FC9_atlas/1 39.2s  35.0s  30.7s 29.6s 21.5s 19.60s
        goto/1            18.7s  16.1s  14.2s 13.7s 16.1s 14.67s
        numpy_MAN_atlas/2 12.0s  11.6s  10.2s  9.2s  9.0s
        goto/2             9.5s   8.1s   7.1s  7.3s  8.1s  7.4s
        goto/4             4.9s   4.4s   3.7s  -     4.1s  3.8s
        goto/8             2.7s   2.4s   2.0s  -     4.1s  3.8s
        openblas/1                                        14.04s
        openblas/2                                         7.16s
        openblas/4                                         3.71s
        openblas/8                                         3.70s
        mkl 11.0.083/1            7.97s
        mkl 10.2.2.025/1                                         13.7s
        mkl 10.2.2.025/2                                          7.6s
        mkl 10.2.2.025/4                                          4.0s
        mkl 10.2.2.025/8                                          2.0s
        goto2 1.13/1                                                     14.37s
        goto2 1.13/2                                                      7.26s
        goto2 1.13/4                                                      3.70s
        goto2 1.13/8                                                      1.94s
        goto2 1.13/16                                                     3.16s

        Test time in float32. There were 10 executions of gemm in
        float32 with matrices of shape 5000x5000 (M=N=K=5000)
        All memory layout was in C order.


        cuda version      8.0    7.5    7.0
        gpu
        M40               0.45s  0.47s
        k80               0.92s  0.96s
        K6000/NOECC       0.71s         0.69s
        P6000/NOECC       0.25s

        Titan X (Pascal)  0.28s
        GTX Titan X       0.45s  0.45s  0.47s
        GTX Titan Black   0.66s  0.64s  0.64s
        GTX 1080          0.35s
        GTX 980 Ti               0.41s
        GTX 970                  0.66s
        GTX 680                         1.57s
        GTX 750 Ti               2.01s  2.01s
        GTX 750                  2.46s  2.37s
        GTX 660                  2.32s  2.32s
        GTX 580                  2.42s
        GTX 480                  2.87s
        TX1                             7.6s (float32 storage and computation)
        GT 610                          33.5s

Some PyTensor flags:
    blas__ldflags= -LC:\Users\user\miniconda3\envs\pm5_ob\Library\bin -lopenblas
    compiledir= C:\Users\user\AppData\Local\PyTensor\compiledir_Windows-10-10.0.19045-SP0-AMD64_Family_23_Model_113_Stepping_0_AuthenticAMD-3.11.5-64
    floatX= float64
    device= cpu
Some OS information:
    sys.platform= win32
    sys.version= 3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)]
    sys.prefix= C:\Users\user\miniconda3\envs\pm5_ob
Some environment variables:
    MKL_NUM_THREADS= None
    OMP_NUM_THREADS= None
    GOTO_NUM_THREADS= None

Numpy config: (used when the PyTensor flag "blas__ldflags" is empty)
blas_info:
    libraries = ['cblas', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['C:/Users/user/miniconda3/envs/pm5_ob\\Library\\lib']
    include_dirs = ['C:/Users/user/miniconda3/envs/pm5_ob\\Library\\include']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    libraries = ['cblas', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['C:/Users/user/miniconda3/envs/pm5_ob\\Library\\lib']
    include_dirs = ['C:/Users/user/miniconda3/envs/pm5_ob\\Library\\include']
    language = f77
lapack_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas']
    library_dirs = ['C:/Users/user/miniconda3/envs/pm5_ob\\Library\\lib']
    language = f77
lapack_opt_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['C:/Users/user/miniconda3/envs/pm5_ob\\Library\\lib']
    language = f77
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['C:/Users/user/miniconda3/envs/pm5_ob\\Library\\include']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
    not found = AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
Numpy dot module: numpy
Numpy location: C:\Users\user\miniconda3\envs\pm5_ob\Lib\site-packages\numpy\__init__.py
Numpy version: 1.25.2

We executed 10 calls to gemm with a and b matrices of shapes (5000, 5000) and (5000, 5000).

Total execution time: 10.58s on CPU (with direct PyTensor binding to blas).

Try to run this script a few times. Experience shows that the first time is not as fast as following calls. The difference is not big, but consistent.
pytensor.config Output
floatX ({'float64', 'float32', 'float16'}) 
    Doc:  Default floating-point precision for python casts.

Note: float16 support is experimental, use at your own risk.
    Value:  float64

warn_float64 ({'pdb', 'ignore', 'warn', 'raise'}) 
    Doc:  Do an action when a tensor variable with float64 dtype is created.
    Value:  ignore

pickle_test_value (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D779545B10>>) 
    Doc:  Dump test values while pickling model. If True, test values will be dumped with model.
    Value:  True

cast_policy ({'numpy+floatX', 'custom'}) 
    Doc:  Rules for implicit type casting
    Value:  custom

deterministic ({'default', 'more'}) 
    Doc:  If `more`, sometimes we will select some implementation that are more deterministic, but slower.  Also see the dnn.conv.algo* flags to cover more cases.
    Value:  default

device (cpu)
    Doc:  Default device for computations. only cpu is supported for now
    Value:  cpu

force_device (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D779B5A090>>) 
    Doc:  Raise an error if we can't use the specified device
    Value:  False

conv__assert_shape (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D779A8DAD0>>) 
    Doc:  If True, AbstractConv* ops will verify that user-provided shapes match the runtime shapes (debugging option, may slow down compilation)
    Value:  False

print_global_stats (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D69A4D0>>) 
    Doc:  Print some global statistics (time spent) at the end
    Value:  False

assert_no_cpu_op ({'pdb', 'ignore', 'warn', 'raise'}) 
    Doc:  Raise an error/warning if there is a CPU op in the computational graph.
    Value:  ignore

unpickle_function (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86CFBB090>>) 
    Doc:  Replace unpickled PyTensor functions with None. This is useful to unpickle old graphs that pickled them when it shouldn't
    Value:  True

<pytensor.configparser.ConfigParam object at 0x000001D86CFB2F10>
    Doc:  Default compilation mode
    Value:  Mode

cxx (<class 'str'>) 
    Doc:  The C++ compiler to use. Currently only g++ is supported, but supporting additional compilers should not be too difficult. If it is empty, no C++ code is compiled.
    Value:  "C:\Users\user\miniconda3\envs\pm5_ob\Library\mingw-w64\bin\g++.exe"

linker ({'cvm_nogc', 'vm_nogc', 'vm', 'c|py_nogc', 'py', 'c|py', 'c', 'cvm'}) 
    Doc:  Default linker used if the pytensor flags mode is Mode
    Value:  cvm

allow_gc (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D69A610>>) 
    Doc:  Do we default to delete intermediate results during PyTensor function calls? Doing so lowers the memory requirement, but asks that we reallocate memory at the next function call. This is implemented for the default linker, but may not work for all linkers.
    Value:  True

optimizer ({'o1', 'fast_run', 'merge', 'o2', 'fast_compile', 'o3', 'None', 'o4', 'unsafe'}) 
    Doc:  Default optimizer. If not None, will use this optimizer with the Mode
    Value:  o4

optimizer_verbose (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86BDEDBD0>>) 
    Doc:  If True, we print all optimization being applied
    Value:  False

on_opt_error ({'pdb', 'ignore', 'warn', 'raise'}) 
    Doc:  What to do when an optimization crashes: warn and skip it, raise the exception, or fall into the pdb debugger.
    Value:  warn

nocleanup (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D69A850>>) 
    Doc:  Suppress the deletion of code files that did not compile cleanly
    Value:  False

on_unused_input ({'ignore', 'warn', 'raise'}) 
    Doc:  What to do if a variable in the 'inputs' list of  pytensor.function() is not used in the graph.
    Value:  raise

gcc__cxxflags (<class 'str'>) 
    Doc:  Extra compiler flags for gcc
    Value:  

cmodule__warn_no_version (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D69AB10>>) 
    Doc:  If True, will print a warning when compiling one or more Op with C code that can't be cached because there is no c_code_cache_version() function associated to at least one of those Ops.
    Value:  False

cmodule__remove_gxx_opt (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D779A55890>>) 
    Doc:  If True, will remove the -O* parameter passed to g++.This is useful to debug in gdb modules compiled by PyTensor.The parameter -g is passed by default to g++
    Value:  False

cmodule__compilation_warning (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86BDD5D10>>) 
    Doc:  If True, will print compilation warnings.
    Value:  False

cmodule__preload_cache (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86BFCAC90>>) 
    Doc:  If set to True, will preload the C module cache at import time
    Value:  False

cmodule__age_thresh_use (<class 'int'>) 
    Doc:  In seconds. The time after which PyTensor won't reuse a compile c module.
    Value:  2073600

cmodule__debug (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D779545990>>) 
    Doc:  If True, define a DEBUG macro (if not exists) for any compiled C code.
    Value:  False

compile__wait (<class 'int'>) 
    Doc:  Time to wait before retrying to acquire the compile lock.
    Value:  5

compile__timeout (<class 'int'>) 
    Doc:  In seconds, time that a process will wait before deciding to
    override an existing lock. An override only happens when the existing
    lock is held by the same owner *and* has not been 'refreshed' by this
    owner for more than this period. Refreshes are done every half timeout
    period for running processes.
    Value:  120

ctc__root (<class 'str'>) 
    Doc:  Directory which contains the root of Baidu CTC library. It is assumed         that the compiled library is either inside the build, lib or lib64         subdirectory, and the header inside the include directory.
    Value:  

tensor__cmp_sloppy (<class 'int'>) 
    Doc:  Relax pytensor.tensor.math._allclose (0) not at all, (1) a bit, (2) more
    Value:  0

tensor__local_elemwise_fusion (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D69B090>>) 
    Doc:  Enable or not in fast_run mode(fast_run optimization) the elemwise fusion optimization
    Value:  True

lib__amblibm (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D69B110>>) 
    Doc:  Use amd's amdlibm numerical library
    Value:  False

tensor__insert_inplace_optimizer_validate_nb (<class 'int'>) 
    Doc:  -1: auto, if graph have less then 500 nodes 1, else 10
    Value:  -1

traceback__limit (<class 'int'>) 
    Doc:  The number of stack to trace. -1 mean all.
    Value:  8

traceback__compile_limit (<class 'int'>) 
    Doc:  The number of stack to trace to keep during compilation. -1 mean all. If greater then 0, will also make us save PyTensor internal stack trace.
    Value:  0

experimental__local_alloc_elemwise (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D69A910>>) 
    Doc:  DEPRECATED: If True, enable the experimental optimization local_alloc_elemwise. Generates error if not True. Use optimizer_excluding=local_alloc_elemwise to disable.
    Value:  True

experimental__local_alloc_elemwise_assert (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D71140F290>>) 
    Doc:  When the local_alloc_elemwise is applied, add an assert to highlight shape errors.
    Value:  True

warn__ignore_bug_before ({'0.4', '1.0', '0.7', '0.10', '0.3', 'None', '1.0.3', '0.8.1', '1.0.2', '0.8', '0.5', 'all', '1.0.1', '1.0.4', '0.9', '1.0.5', '0.6', '0.8.2', '0.4.1'}) 
    Doc:  If 'None', we warn about all PyTensor bugs found by default. If 'all', we don't warn about PyTensor bugs found by default. If a version, we print only the warnings relative to PyTensor bugs found after that version. Warning for specific bugs can be configured with specific [warn] flags.
    Value:  0.9

exception_verbosity ({'high', 'low'}) 
    Doc:  If 'low', the text of exceptions will generally refer to apply nodes with short names such as Elemwise{add_no_inplace}. If 'high', some exceptions will also refer to apply nodes with long descriptions  like:
        A. Elemwise{add_no_inplace}
                B. log_likelihood_v_given_h
                C. log_likelihood_h
    Value:  low

print_test_value (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D69B490>>) 
    Doc:  If 'True', the __eval__ of an PyTensor variable will return its test_value when this is available. This has the practical consequence that, e.g., in debugging `my_var` will print the same as `my_var.tag.test_value` when a test value is defined.
    Value:  False

compute_test_value ({'ignore', 'raise', 'off', 'pdb', 'warn'}) 
    Doc:  If 'True', PyTensor will run each op at graph build time, using Constants, SharedVariables and the tag 'test_value' as inputs to the function. This helps the user track down problems in the graph before it gets optimized.
    Value:  off

compute_test_value_opt ({'ignore', 'raise', 'off', 'pdb', 'warn'}) 
    Doc:  For debugging PyTensor optimization only. Same as compute_test_value, but is used during PyTensor optimization
    Value:  off

check_input (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D69B650>>) 
    Doc:  Specify if types should check their input in their C code. It can be used to speed up compilation, reduce overhead (particularly for scalars) and reduce the number of generated C files.
    Value:  True

NanGuardMode__nan_is_error (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D69B590>>) 
    Doc:  Default value for nan_is_error
    Value:  True

NanGuardMode__inf_is_error (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86BE0A4D0>>) 
    Doc:  Default value for inf_is_error
    Value:  True

NanGuardMode__big_is_error (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D69B850>>) 
    Doc:  Default value for big_is_error
    Value:  True

NanGuardMode__action ({'pdb', 'warn', 'raise'}) 
    Doc:  What NanGuardMode does when it finds a problem
    Value:  raise

DebugMode__patience (<class 'int'>) 
    Doc:  Optimize graph this many times to detect inconsistency
    Value:  10

DebugMode__check_c (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D69B9D0>>) 
    Doc:  Run C implementations where possible
    Value:  True

DebugMode__check_py (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D69BA10>>) 
    Doc:  Run Python implementations where possible
    Value:  True

DebugMode__check_finite (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D69BB50>>) 
    Doc:  True -> complain about NaN/Inf results
    Value:  True

DebugMode__check_strides (<class 'int'>) 
    Doc:  Check that Python- and C-produced ndarrays have same strides. On difference: (0) - ignore, (1) warn, or (2) raise error
    Value:  0

DebugMode__warn_input_not_reused (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D69BC10>>) 
    Doc:  Generate a warning when destroy_map or view_map says that an op works inplace, but the op did not reuse the input for its output.
    Value:  True

DebugMode__check_preallocated_output (<class 'str'>) 
    Doc:  Test thunks with pre-allocated memory as output storage. This is a list of strings separated by ":". Valid values are: "initial" (initial storage in storage map, happens with Scan),"previous" (previously-returned memory), "c_contiguous", "f_contiguous", "strided" (positive and negative strides), "wrong_size" (larger and smaller dimensions), and "ALL" (all of the above).
    Value:  

DebugMode__check_preallocated_output_ndim (<class 'int'>) 
    Doc:  When testing with "strided" preallocated output memory, test all combinations of strides over that number of (inner-most) dimensions. You may want to reduce that number to reduce memory or time usage, but it is advised to keep a minimum of 2.
    Value:  4

profiling__time_thunks (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D69BE10>>) 
    Doc:  Time individual thunks when profiling
    Value:  True

profiling__n_apply (<class 'int'>) 
    Doc:  Number of Apply instances to print by default
    Value:  20

profiling__n_ops (<class 'int'>) 
    Doc:  Number of Ops to print by default
    Value:  20

profiling__output_line_width (<class 'int'>) 
    Doc:  Max line width for the profiling output
    Value:  512

profiling__min_memory_size (<class 'int'>) 
    Doc:  For the memory profile, do not print Apply nodes if the size
                 of their outputs (in bytes) is lower than this threshold
    Value:  1024

profiling__min_peak_memory (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D779A72550>>) 
    Doc:  The min peak memory usage of the order
    Value:  False

profiling__destination (<class 'str'>) 
    Doc:  File destination of the profiling output
    Value:  stderr

profiling__debugprint (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D69BFD0>>) 
    Doc:  Do a debugprint of the profiled functions
    Value:  False

profiling__ignore_first_call (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D6A8090>>) 
    Doc:  Do we ignore the first call of an PyTensor function.
    Value:  False

on_shape_error ({'warn', 'raise'}) 
    Doc:  warn: print a warning and use the default value. raise: raise an error
    Value:  warn

openmp (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D779A73310>>) 
    Doc:  Allow (or not) parallel computation on the CPU with OpenMP. This is the default value used when creating an Op that supports OpenMP parallelization. It is preferable to define it via the PyTensor configuration file ~/.pytensorrc or with the environment variable PYTENSOR_FLAGS. Parallelization is only done for some operations that implement it, and even for operations that implement parallelism, each operation is free to respect this flag or not. You can control the number of threads used with the environment variable OMP_NUM_THREADS. If it is set to 1, we disable openmp in PyTensor by default.
    Value:  False

openmp_elemwise_minsize (<class 'int'>) 
    Doc:  If OpenMP is enabled, this is the minimum size of vectors for which the openmp parallelization is enabled in element wise ops.
    Value:  200000

optimizer_excluding (<class 'str'>) 
    Doc:  When using the default mode, we will remove optimizer with these tags. Separate tags with ':'.
    Value:  

optimizer_including (<class 'str'>) 
    Doc:  When using the default mode, we will add optimizer with these tags. Separate tags with ':'.
    Value:  

optimizer_requiring (<class 'str'>) 
    Doc:  When using the default mode, we will require optimizer with these tags. Separate tags with ':'.
    Value:  

optdb__position_cutoff (<class 'float'>) 
    Doc:  Where to stop earlier during optimization. It represent the position of the optimizer where to stop.
    Value:  inf

optdb__max_use_ratio (<class 'float'>) 
    Doc:  A ratio that prevent infinite loop in EquilibriumGraphRewriter.
    Value:  8.0

cycle_detection ({'fast', 'regular'}) 
    Doc:  If cycle_detection is set to regular, most inplaces are allowed,but it is slower. If cycle_detection is set to faster, less inplacesare allowed, but it makes the compilation faster.The interaction of which one give the lower peak memory usage iscomplicated and not predictable, so if you are close to the peakmemory usage, triyng both could give you a small gain.
    Value:  regular

check_stack_trace ({'off', 'warn', 'log', 'raise'}) 
    Doc:  A flag for checking the stack trace during the optimization process. default (off): does not check the stack trace of any optimization log: inserts a dummy stack trace that identifies the optimizationthat inserted the variable that had an empty stack trace.warn: prints a warning if a stack trace is missing and also a dummystack trace is inserted that indicates which optimization insertedthe variable that had an empty stack trace.raise: raises an exception if a stack trace is missing
    Value:  off

metaopt__verbose (<class 'int'>) 
    Doc:  0 for silent, 1 for only warnings, 2 for full output withtimings and selected implementation
    Value:  0

metaopt__optimizer_excluding (<class 'str'>) 
    Doc:  exclude optimizers with these tags. Separate tags with ':'.
    Value:  

metaopt__optimizer_including (<class 'str'>) 
    Doc:  include optimizers with these tags. Separate tags with ':'.
    Value:  

unittests__rseed (<class 'str'>) 
    Doc:  Seed to use for randomized unit tests. Special value 'random' means using a seed of None.
    Value:  666

warn__round (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D6A88D0>>) 
    Doc:  Warn when using `tensor.round` with the default mode. Round changed its default from `half_away_from_zero` to `half_to_even` to have the same default as NumPy.
    Value:  False

profile (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D6A87D0>>) 
    Doc:  If VM should collect profile information
    Value:  False

profile_optimizer (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86BE1D710>>) 
    Doc:  If VM should collect optimizer profile information
    Value:  False

profile_memory (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D54B350>>) 
    Doc:  If VM should collect memory profile information and print it
    Value:  False

<pytensor.configparser.ConfigParam object at 0x000001D779A73390>
    Doc:  Useful only for the VM Linkers. When lazy is None, auto detect if lazy evaluation is needed and use the appropriate version. If the C loop isn't being used and lazy is True, use the Stack VM; otherwise, use the Loop VM.
    Value:  None

numba__vectorize_target ({'cpu', 'cuda', 'parallel'}) 
    Doc:  Default target for numba.vectorize.
    Value:  cpu

numba__fastmath (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86BDD5F10>>) 
    Doc:  If True, use Numba's fastmath mode.
    Value:  True

numba__cache (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86BDFDAD0>>) 
    Doc:  If True, use Numba's file based caching.
    Value:  True

compiledir_format (<class 'str'>) 
    Doc:  Format string for platform-dependent compiled module subdirectory
(relative to base_compiledir). Available keys: device, gxx_version,
hostname, numpy_version, platform, processor, pytensor_version,
python_bitwidth, python_int_bitwidth, python_version, short_platform.
Defaults to compiledir_%(short_platform)s-%(processor)s-
%(python_version)s-%(python_bitwidth)s.
    Value:  compiledir_%(short_platform)s-%(processor)s-%(python_version)s-%(python_bitwidth)s

<pytensor.configparser.ConfigParam object at 0x000001D86CFB9A50>
    Doc:  platform-independent root directory for compiled modules
    Value:  C:\Users\user\AppData\Local\PyTensor

<pytensor.configparser.ConfigParam object at 0x000001D86BDD5250>
    Doc:  platform-dependent cache directory for compiled modules
    Value:  C:\Users\user\AppData\Local\PyTensor\compiledir_Windows-10-10.0.19045-SP0-AMD64_Family_23_Model_113_Stepping_0_AuthenticAMD-3.11.5-64

blas__ldflags (<class 'str'>) 
    Doc:  lib[s] to include for [Fortran] level-3 blas implementation
    Value:  -LC:\Users\user\miniconda3\envs\pm5_ob\Library\bin -lopenblas

blas__check_openmp (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86D926E10>>) 
    Doc:  Check for openmp library conflict.
WARNING: Setting this to False leaves you open to wrong results in blas-related operations.
    Value:  True

scan__allow_gc (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86F9E7DD0>>) 
    Doc:  Allow/disallow gc inside of Scan (default: False)
    Value:  False

scan__allow_output_prealloc (<bound method BoolParam._apply of <pytensor.configparser.BoolParam object at 0x000001D86E4DB890>>) 
    Doc:  Allow/disallow memory preallocation for outputs inside of scan (default: True)
    Value:  True



Your model probably involves no Operation that triggers BLAS multithreading. Usually you need a dot in it.

You probably cab trigger openmp but you have to enable it manually and play with the minsize: Multi cores support in PyTensor — PyTensor dev documentation

Here is a tweet that shows ir as well: https://x.com/pymc_devs/status/1661277738437603329?s=20

Should the following code trigger BLAS multithreading? It is single-threaded for me.

import pymc as pm
import pytensor.tensor as pt
import numpy as np

with pm.Model() as model:
    x = pm.Normal("x", mu=0, shape=1000)
    y = pm.Normal("y", mu=0, shape=1000)
    pm.Normal("obs_x", mu=x, observed=np.random.normal(size=(1000, 1000)))
    pm.Normal("obs_y", mu=y, observed=np.random.normal(size=(1000, 1000)))
    pm.Normal("obs_z", mu=pt.dot(x, y), observed=np.random.normal(size=1000))
    pm.sample(cores=1, chains=1, draws=1000)

OpenMP works correctly for my setup.

Try to increase x,y size. Also you can do:

import pytensor

with pm.Model() as m:
  ...

f = m.compile_logp().f
pytensor.dprint(f)

If you see a Dot22 or GEMM it’s using Blas Ops.

The sizes being 1000 are not enough to trigger the multi-threading for MKL. It needs at least 10,000. Is it expected?

OpenBLAS virtually does not budge regardless of the sizes. There may be other problems.

The CPU Usages are

Size 1,000 10,000 100,000 1,000,000
MKL 8% 56% 56% 56%
OpenBLAS 8% 9% 9% 12%
import pymc as pm
import pytensor
import pytensor.tensor as pt
import numpy as np

size = 1000000
obs_size = 10
np.random.seed(0)

with pm.Model() as model:
    x = pm.Normal("x", mu=0, shape=size)
    y = pm.Normal("y", mu=0, shape=size)
    pm.Normal("obs_x", mu=x, observed=np.random.normal(size=(obs_size, size)))
    pm.Normal("obs_y", mu=y, observed=np.random.normal(size=(obs_size, size)))
    pm.Normal("obs_z", mu=pt.sum(pt.dot(x, y)), observed=np.random.normal(size=size))
    f = model.compile_logp().f
    pytensor.dprint(f)
    m = pm.sample(cores=1, chains=1, draws=1000)

There are no Dot22 or GEMM but CGemv.

pytensor.dprint(f)
Sum{axes=None} [id A] '__logp' 16
 └─ MakeVector{dtype='float64'} [id B] 15
    ├─ Sum{axes=None} [id C] 14
    │  └─ Composite{((i1 * sqr(i0)) - i2)} [id D] 'sigma > 0' 13
    │     ├─ x [id E]
    │     ├─ [-0.5] [id F]
    │     └─ [0.91893853] [id G]
    ├─ Sum{axes=None} [id H] 12
    │  └─ Composite{((i1 * sqr(i0)) - i2)} [id I] 'sigma > 0' 11
    │     ├─ y [id J]
    │     ├─ [-0.5] [id F]
    │     └─ [0.91893853] [id G]
    ├─ Sum{axes=None} [id K] 10
    │  └─ Composite{((i2 * sqr((i0 - i1))) - i3)} [id L] 'sigma > 0' 9
    │     ├─ obs_x{[[ 1.76405 ... 29811143]]} [id M]
    │     ├─ ExpandDims{axis=0} [id N] 8
    │     │  └─ x [id E]
    │     ├─ [[-0.5]] [id O]
    │     └─ [[0.91893853]] [id P]
    ├─ Sum{axes=None} [id Q] 7
    │  └─ Composite{((i2 * sqr((i0 - i1))) - i3)} [id R] 'sigma > 0' 6
    │     ├─ obs_y{[[-0.20211 ... 32652844]]} [id S]
    │     ├─ ExpandDims{axis=0} [id T] 5
    │     │  └─ y [id J]
    │     ├─ [[-0.5]] [id O]
    │     └─ [[0.91893853]] [id P]
    └─ Sum{axes=None} [id U] 4
       └─ Composite{((i2 * sqr((i0 - i1))) - i3)} [id V] 'sigma > 0' 3
          ├─ obs_z{[ 3.300458 ... 41499e+00]} [id W]
          ├─ CGemv{inplace} [id X] 2
          │  ├─ AllocEmpty{dtype='float64'} [id Y] 1
          │  │  └─ 1 [id Z]
          │  ├─ 1.0 [id BA]
          │  ├─ ExpandDims{axis=0} [id BB] 0
          │  │  └─ y [id J]
          │  ├─ x [id E]
          │  └─ 0.0 [id BC]
          ├─ [-0.5] [id F]
          └─ [0.91893853] [id G]

Inner graphs:

Composite{((i1 * sqr(i0)) - i2)} [id D]
 ← sub [id BD] 'o0'
    ├─ mul [id BE]
    │  ├─ i1 [id BF]
    │  └─ sqr [id BG]
    │     └─ i0 [id BH]
    └─ i2 [id BI]

Composite{((i1 * sqr(i0)) - i2)} [id I]
 ← sub [id BJ] 'o0'
    ├─ mul [id BK]
    │  ├─ i1 [id BL]
    │  └─ sqr [id BM]
    │     └─ i0 [id BN]
    └─ i2 [id BO]

Composite{((i2 * sqr((i0 - i1))) - i3)} [id L]
 ← sub [id BP] 'o0'
    ├─ mul [id BQ]
    │  ├─ i2 [id BR]
    │  └─ sqr [id BS]
    │     └─ sub [id BT]
    │        ├─ i0 [id BU]
    │        └─ i1 [id BV]
    └─ i3 [id BW]

Composite{((i2 * sqr((i0 - i1))) - i3)} [id R]
 ← sub [id BX] 'o0'
    ├─ mul [id BY]
    │  ├─ i2 [id BZ]
    │  └─ sqr [id CA]
    │     └─ sub [id CB]
    │        ├─ i0 [id CC]
    │        └─ i1 [id CD]
    └─ i3 [id CE]

Composite{((i2 * sqr((i0 - i1))) - i3)} [id V]
 ← sub [id CF] 'o0'
    ├─ mul [id CG]
    │  ├─ i2 [id CH]
    │  └─ sqr [id CI]
    │     └─ sub [id CJ]
    │        ├─ i0 [id CK]
    │        └─ i1 [id CL]
    └─ i3 [id CM]

Thats doesn’t sound too unreasonable. CPUs are pretty fast these days and multithreading only gives improvement for rather large arrays, so I think they are conservative about when to trigger it.

You may see more clear usage with a matrix to matrix dot instead of the vector to vector one you’re using.

Anyway PyTensor tests showed it was well linked so there shouldn’t be anything wrong.

Indeed, the number of multiplications in the naive algorithm of multiplication of two 100 by 100 matrices is 1,000,000. The code below triggers parallelization in MKL with a matrix size of 100.

import pymc as pm
import pytensor
import pytensor.tensor as pt
import numpy as np

size = 100
obs_size = 10
np.random.seed(0)

with pm.Model() as model:
    x = pm.Normal("x", mu=0, shape=(size, size))
    y = pm.Normal("y", mu=0, shape=(size, size))
    pm.Normal("obs_x", mu=x, observed=np.random.normal(size=(obs_size, size, size)))
    pm.Normal("obs_y", mu=y, observed=np.random.normal(size=(obs_size, size, size)))
    pm.Normal("obs_z", mu=pt.matmul(x, y), observed=np.random.normal(size=(obs_size, size, size)))
    f = model.compile_logp().f
    pytensor.dprint(f)
    m = pm.sample(cores=1, chains=1, draws=1000)

However, it with OpenBLAS still does not fully load my CPU, only 20% max. I do not use OpenBLAS as MKL works acceptable on AMD nowadays, so I will leave this problem to whoever uses OpenBLAS in the future.