Slow sampling speed with newer versions of PyMC

I can’t directly benchmark this myself, because I don’t have a mac. But I set up a small pixi repo that should hopefully make experimentation with this a bit easier and more reproducable?

On linux I get an ETA of ~15min early in tuning (I wasn’t patient enough to wait) with both openblas and mkl. To test this on mac, you can change the platform entry in pixi.toml to either "osx-64" or "osx-arm64" on a ARM machine, and change all mkl mentions to accellerate. You can then run different configurations with

# Run with an old pymc
pixi run --environment openblas-old run-benchmarks
# Run with a new pymc
pixi run --environment openblas-old run-benchmarks
# Run with mkl and old pymc (or change to accellerate on mac)
pixi run --environment mkl-old run-benchmarks

In the pixi.toml file this also sets the number of threads for openblas and mkl to 1, different values might make it faster or slower, also depending a lot on how many cores you have.
Keep in mind that it will use 4 times the number of threads, because of the 4 chains that are run in parallel, which each use blas independently.

About the nutpie problems: This is gotta be slow because of the dammed AdvancedSetSubtensor op in that graph. We really need to fix this, or at least come up with rewrites that solve most of those cases…

Edit

Link to the repo: GitHub - aseyboldt/tmp-benchmark-rethink

1 Like

That is interesting :slight_smile:
Why would sqlite have anything to do with this?
And does it also work if you specify the libblas build as openblas (ie conda install libblas=*=*openblas*)?

I got the same slow speed on linux (via WLS2), fwiw

Edit: but i didn’t think to check performance with cores=1, this gets me to the ~15 minutes you reported early in tuning, and it goes down from there to around 5. So it might be something to do with a resource bottleneck due to BLAS?

Could it simply be the lack of setting the number of threads in blas?
I also get pretty bad performance if I don’t set those, because blas is trying to use way too many cores.

import os
os.environ['MKL_NUM_THREADS'] = '1'

At the top of a script doesn’t seem to do anything for me, am I missing something?

Also if it is just a BLAS resource problem, I’m still curious why the specific bundle of packages in that rethinking repo results in good sampling, while the fresh install doesn’t

How many threads is it using? You can at least sanity check it with htop or the system monitor or so. If there are more worker threads in total than you have cores something is doing something stupid…

Depending on the blas you are using, you might also have to set for instance OPENBLAS_NUM_THREADS. Which blas recognizes which environment variable can be a bit messy…

Ok it was os.environ['OMP_NUM_THREADS'] = '1' in my case, and it fixed the problem.

So I’m back to wondering what about this environment.yaml fixes this. I assume some random package is somehow setting this variable?

name: statistical-rethinking-2023
channels:
  - conda-forge
dependencies:
  - anyio=4.0.0=pyhd8ed1ab_0
  - appnope=0.1.3=pyhd8ed1ab_0
  - argon2-cffi=23.1.0=pyhd8ed1ab_0
  - argon2-cffi-bindings=21.2.0=py310h8e9501a_3
  - arrow=1.2.3=pyhd8ed1ab_0
  - arviz=0.16.1=pyhd8ed1ab_0
  - asttokens=2.4.0=pyhd8ed1ab_0
  - async-lru=2.0.4=pyhd8ed1ab_0
  - atk-1.0=2.38.0=hcb7b3dd_1
  - attrs=23.1.0=pyh71513ae_1
  - babel=2.12.1=pyhd8ed1ab_1
  - backcall=0.2.0=pyh9f0ad1d_0
  - backports=1.0=pyhd8ed1ab_3
  - backports.functools_lru_cache=1.6.5=pyhd8ed1ab_0
  - beautifulsoup4=4.12.2=pyha770c72_0
  - blas=2.117=accelerate
  - blas-devel=3.9.0=17_osxarm64_accelerate
  - bleach=6.0.0=pyhd8ed1ab_0
  - brotli=1.1.0=hb547adb_0
  - brotli-bin=1.1.0=hb547adb_0
  - brotli-python=1.1.0=py310h1253130_0
  - bzip2=1.0.8=h3422bc3_4
  - c-ares=1.19.1=hb547adb_0
  - ca-certificates=2023.7.22=hf0a4a13_0
  - cached-property=1.5.2=hd8ed1ab_1
  - cached_property=1.5.2=pyha770c72_1
  - cachetools=5.3.1=pyhd8ed1ab_0
  - cairo=1.16.0=hc5b65c1_1017
  - cctools_osx-arm64=973.0.1=h2a25c60_14
  - certifi=2023.7.22=pyhd8ed1ab_0
  - cffi=1.15.1=py310h2399d43_3
  - charset-normalizer=3.2.0=pyhd8ed1ab_0
  - clang=15.0.7=hce30654_3
  - clang-15=15.0.7=default_h5dc8d65_3
  - clang_osx-arm64=15.0.7=h77e971b_3
  - clangxx=15.0.7=default_h610c423_3
  - clangxx_osx-arm64=15.0.7=h768a7fd_3
  - cloudpickle=2.2.1=pyhd8ed1ab_0
  - colorama=0.4.6=pyhd8ed1ab_0
  - comm=0.1.4=pyhd8ed1ab_0
  - compiler-rt=15.0.7=hf8d1dfb_1
  - compiler-rt_osx-arm64=15.0.7=hf8d1dfb_1
  - cons=0.4.6=pyhd8ed1ab_0
  - contourpy=1.1.0=py310h38f39d4_0
  - cycler=0.11.0=pyhd8ed1ab_0
  - debugpy=1.7.0=py310h1253130_0
  - decorator=5.1.1=pyhd8ed1ab_0
  - defusedxml=0.7.1=pyhd8ed1ab_0
  - entrypoints=0.4=pyhd8ed1ab_0
  - etuples=0.3.9=pyhd8ed1ab_0
  - exceptiongroup=1.1.3=pyhd8ed1ab_0
  - executing=1.2.0=pyhd8ed1ab_0
  - expat=2.5.0=hb7217d7_1
  - fastprogress=1.0.3=pyhd8ed1ab_0
  - filelock=3.12.3=pyhd8ed1ab_0
  - font-ttf-dejavu-sans-mono=2.37=hab24e00_0
  - font-ttf-inconsolata=3.000=h77eed37_0
  - font-ttf-source-code-pro=2.038=h77eed37_0
  - font-ttf-ubuntu=0.83=hab24e00_0
  - fontconfig=2.14.2=h82840c6_0
  - fonts-conda-ecosystem=1=0
  - fonts-conda-forge=1=0
  - fonttools=4.42.1=py310h2aa6e3c_0
  - fqdn=1.5.1=pyhd8ed1ab_0
  - freetype=2.12.1=hd633e50_1
  - fribidi=1.0.10=h27ca646_0
  - gdk-pixbuf=2.42.10=h1ac0d0d_2
  - gettext=0.21.1=h0186832_0
  - giflib=5.2.1=h1a8c8d9_3
  - graphite2=1.3.13=h9f76cd9_1001
  - graphviz=8.1.0=h10878c0_0
  - gtk2=2.24.33=h57013de_2
  - gts=0.7.6=he42f4ea_4
  - h5netcdf=1.2.0=pyhd8ed1ab_0
  - h5py=3.9.0=nompi_py310hba14233_102
  - harfbuzz=8.2.0=hf1a6348_0
  - hdf5=1.14.2=nompi_h3aba7b3_100
  - icu=73.2=hc8870d7_0
  - idna=3.4=pyhd8ed1ab_0
  - importlib-metadata=6.8.0=pyha770c72_0
  - importlib_metadata=6.8.0=hd8ed1ab_0
  - importlib_resources=6.0.1=pyhd8ed1ab_0
  - ipykernel=6.25.2=pyh1050b4e_0
  - ipython=8.15.0=pyh31c8845_0
  - isoduration=20.11.0=pyhd8ed1ab_0
  - jax=0.4.14=pyhd8ed1ab_1
  - jaxlib=0.4.14=cpu_py310hec32d7d_1
  - jedi=0.19.0=pyhd8ed1ab_0
  - jinja2=3.1.2=pyhd8ed1ab_1
  - json5=0.9.14=pyhd8ed1ab_0
  - jsonpointer=2.0=py_0
  - jsonschema=4.19.0=pyhd8ed1ab_1
  - jsonschema-specifications=2023.7.1=pyhd8ed1ab_0
  - jsonschema-with-format-nongpl=4.19.0=pyhd8ed1ab_1
  - jupyter-lsp=2.2.0=pyhd8ed1ab_0
  - jupyter_client=8.3.1=pyhd8ed1ab_0
  - jupyter_core=5.3.1=py310hbe9552e_0
  - jupyter_events=0.7.0=pyhd8ed1ab_2
  - jupyter_server=2.7.3=pyhd8ed1ab_0
  - jupyter_server_terminals=0.4.4=pyhd8ed1ab_1
  - jupyterlab=4.0.5=pyhd8ed1ab_0
  - jupyterlab_pygments=0.2.2=pyhd8ed1ab_0
  - jupyterlab_server=2.24.0=pyhd8ed1ab_0
  - kiwisolver=1.4.5=py310h38f39d4_0
  - krb5=1.21.2=h92f50d5_0
  - lcms2=2.15=hd835a16_1
  - ld64_osx-arm64=609=hc4dc95b_14
  - lerc=4.0.0=h9a09cb3_0
  - libabseil=20230125.3=cxx17_h13dd4ca_0
  - libaec=1.0.6=hb7217d7_1
  - libblas=3.9.0=17_osxarm64_accelerate
  - libbrotlicommon=1.1.0=hb547adb_0
  - libbrotlidec=1.1.0=hb547adb_0
  - libbrotlienc=1.1.0=hb547adb_0
  - libcblas=3.9.0=17_osxarm64_accelerate
  - libclang-cpp15=15.0.7=default_h5dc8d65_3
  - libcurl=8.2.1=hc52a3a8_0
  - libcxx=16.0.6=h4653b0c_0
  - libdeflate=1.18=h1a8c8d9_0
  - libedit=3.1.20191231=hc8eb9b7_2
  - libev=4.33=h642e427_1
  - libexpat=2.5.0=hb7217d7_1
  - libffi=3.4.2=h3422bc3_5
  - libgd=2.3.3=h32cdd76_7
  - libgfortran=5.0.0=13_2_0_hd922786_1
  - libgfortran5=13.2.0=hf226fd6_1
  - libglib=2.78.0=h24e9cb9_0
  - libgrpc=1.56.2=h9075ed4_1
  - libiconv=1.17=he4db4b2_0
  - libjpeg-turbo=2.1.5.1=h1a8c8d9_0
  - liblapack=3.9.0=17_osxarm64_accelerate
  - liblapacke=3.9.0=17_osxarm64_accelerate
  - libllvm15=15.0.7=h504e6bf_3
  - libnghttp2=1.52.0=hae82a92_0
  - libopenblas=0.3.24=openmp_hd76b1f2_0
  - libpng=1.6.39=h76d750c_0
  - libprotobuf=4.23.3=hf32f9b9_1
  - librsvg=2.56.3=h0db3404_0
  - libsodium=1.0.18=h27ca646_1
  - libsqlite=3.43.0=hb31c410_0
  - libssh2=1.11.0=h7a5bd25_0
  - libtiff=4.5.1=h23a1a89_1
  - libtool=2.4.7=hb7217d7_0
  - libwebp=1.3.1=h3dd3bb6_0
  - libwebp-base=1.3.1=hb547adb_0
  - libxcb=1.15=hf346824_0
  - libxml2=2.11.5=h25269f3_1
  - libzlib=1.2.13=h53f4e23_5
  - llvm-openmp=16.0.6=h1c12783_0
  - llvm-tools=15.0.7=h504e6bf_3
  - logical-unification=0.4.6=pyhd8ed1ab_0
  - markupsafe=2.1.3=py310h2aa6e3c_0
  - matplotlib-base=3.7.2=py310h49faba3_0
  - matplotlib-inline=0.1.6=pyhd8ed1ab_0
  - minikanren=1.0.3=pyhd8ed1ab_0
  - mistune=3.0.1=pyhd8ed1ab_0
  - ml_dtypes=0.2.0=py310h1cdf563_1
  - multipledispatch=0.6.0=py_0
  - munkres=1.1.4=pyh9f0ad1d_0
  - nbclient=0.8.0=pyhd8ed1ab_0
  - nbconvert-core=7.8.0=pyhd8ed1ab_0
  - nbformat=5.9.2=pyhd8ed1ab_0
  - ncurses=6.4=h7ea286d_0
  - nest-asyncio=1.5.6=pyhd8ed1ab_0
  - networkx=3.1=pyhd8ed1ab_0
  - notebook-shim=0.2.3=pyhd8ed1ab_0
  - numpy=1.25.2=py310haa1e00c_0
  - numpyro=0.13.0=pyhd8ed1ab_0
  - openblas=0.3.24=openmp_hce3e5ba_0
  - openjpeg=2.5.0=hbc2ba62_2
  - openssl=3.1.2=h53f4e23_0
  - opt_einsum=3.3.0=pyhd8ed1ab_1
  - overrides=7.4.0=pyhd8ed1ab_0
  - packaging=23.1=pyhd8ed1ab_0
  - pandas=2.1.0=py310h5924a0a_0
  - pandocfilters=1.5.0=pyhd8ed1ab_0
  - pango=1.50.14=hcf40dda_2
  - parso=0.8.3=pyhd8ed1ab_0
  - patsy=0.5.3=pyhd8ed1ab_0
  - pcre2=10.40=hb34f9b4_0
  - pexpect=4.8.0=pyh1a96a4e_2
  - pickleshare=0.7.5=py_1003
  - pillow=10.0.0=py310h60ecbdf_0
  - pip=23.2.1=pyhd8ed1ab_0
  - pixman=0.40.0=h27ca646_0
  - pkgutil-resolve-name=1.3.10=pyhd8ed1ab_0
  - platformdirs=3.10.0=pyhd8ed1ab_0
  - pooch=1.7.0=pyha770c72_3
  - prometheus_client=0.17.1=pyhd8ed1ab_0
  - prompt-toolkit=3.0.39=pyha770c72_0
  - prompt_toolkit=3.0.39=hd8ed1ab_0
  - psutil=5.9.5=py310h8e9501a_0
  - pthread-stubs=0.4=h27ca646_1001
  - ptyprocess=0.7.0=pyhd3deb0d_0
  - pure_eval=0.2.2=pyhd8ed1ab_0
  - pycparser=2.21=pyhd8ed1ab_0
  - pygments=2.16.1=pyhd8ed1ab_0
  - pymc=5.8.0=hd8ed1ab_0
  - pymc-base=5.8.0=pyhd8ed1ab_0
  - pyobjc-core=9.2=py310hd07e440_0
  - pyobjc-framework-cocoa=9.2=py310hd07e440_0
  - pyparsing=3.0.9=pyhd8ed1ab_0
  - pysocks=1.7.1=pyha2e5f31_6
  - pytensor=2.15.0=py310h5993952_0
  - pytensor-base=2.15.0=py310h5924a0a_0
  - python=3.10.12=h01493a6_0_cpython
  - python-dateutil=2.8.2=pyhd8ed1ab_0
  - python-fastjsonschema=2.18.0=pyhd8ed1ab_0
  - python-graphviz=0.20.1=pyh22cad53_0
  - python-json-logger=2.0.7=pyhd8ed1ab_0
  - python-tzdata=2023.3=pyhd8ed1ab_0
  - python_abi=3.10=3_cp310
  - pytz=2023.3.post1=pyhd8ed1ab_0
  - pyyaml=6.0.1=py310h2aa6e3c_0
  - pyzmq=25.1.1=py310h30b7201_0
  - re2=2023.03.02=hc5e2d97_0
  - readline=8.2=h92ec313_1
  - referencing=0.30.2=pyhd8ed1ab_0
  - requests=2.31.0=pyhd8ed1ab_0
  - rfc3339-validator=0.1.4=pyhd8ed1ab_0
  - rfc3986-validator=0.1.1=pyh9f0ad1d_0
  - rpds-py=0.10.2=py310had9acf8_0
  - scipy=1.11.2=py310h0975f3d_0
  - seaborn=0.12.2=hd8ed1ab_0
  - seaborn-base=0.12.2=pyhd8ed1ab_0
  - send2trash=1.8.2=pyhd1c38e8_0
  - setuptools=68.1.2=pyhd8ed1ab_0
  - sigtool=0.1.3=h44b9a77_0
  - six=1.16.0=pyh6c4a22f_0
  - sniffio=1.3.0=pyhd8ed1ab_0
  - soupsieve=2.5=pyhd8ed1ab_1
  - stack_data=0.6.2=pyhd8ed1ab_0
  - statsmodels=0.14.0=py310ha11ecec_1
  - tapi=1100.0.11=he4954df_0
  - terminado=0.17.1=pyhd1c38e8_0
  - tinycss2=1.2.1=pyhd8ed1ab_0
  - tk=8.6.12=he1e0b03_0
  - tomli=2.0.1=pyhd8ed1ab_0
  - toolz=0.12.0=pyhd8ed1ab_0
  - tornado=6.3.3=py310h2aa6e3c_0
  - tqdm=4.66.1=pyhd8ed1ab_0
  - traitlets=5.9.0=pyhd8ed1ab_0
  - typing-extensions=4.7.1=hd8ed1ab_0
  - typing_extensions=4.7.1=pyha770c72_0
  - typing_utils=0.1.0=pyhd8ed1ab_0
  - tzdata=2023c=h71feb2d_0
  - unicodedata2=15.0.0=py310h8e9501a_0
  - uri-template=1.3.0=pyhd8ed1ab_0
  - urllib3=2.0.4=pyhd8ed1ab_0
  - watermark=2.4.3=pyhd8ed1ab_0
  - wcwidth=0.2.6=pyhd8ed1ab_0
  - webcolors=1.13=pyhd8ed1ab_0
  - webencodings=0.5.1=py_1
  - websocket-client=1.6.3=pyhd8ed1ab_0
  - wheel=0.41.2=pyhd8ed1ab_0
  - xarray=2023.8.0=pyhd8ed1ab_0
  - xarray-einstats=0.6.0=pyhd8ed1ab_0
  - xorg-libxau=1.0.11=hb547adb_0
  - xorg-libxdmcp=1.1.3=h27ca646_0
  - xz=5.2.6=h57fd34a_0
  - yaml=0.2.5=h3422bc3_2
  - zeromq=4.3.4=hbdafb3b_1
  - zipp=3.16.2=pyhd8ed1ab_0
  - zlib=1.2.13=h53f4e23_5
  - zstd=1.5.5=h4f39d0f_0
prefix: /opt/homebrew/Caskroom/miniforge/base/envs/statistical-rethinking-2023

Also, if it is possible for a package to set it, should we be doing that?

Packages could in theory set environment variables like this in the activation script, but I really hope nothing does that, and I don’t think we should either. That would mean that simply installing a package into an environment (not even using it!) would have a big impact on how other packages behave.

I think it is more likely that accelerate doesn’t behave so badly if there are too many worker threads. I’ve seen a couple of examples where I think openblas specifically behaves really strange if there are misconfigurations like a vastly too large number of worker threads. I didn’t do any proper investigation, but maybe openblas uses spinnlocks for aggressively or something like that?

I guess what we could do, would be to add a num_blas_workers or so kwarg to pm.sample, that uses for instance GitHub - joblib/threadpoolctl: Python helpers to limit the number of threads used in native libraries that handle their own internal threadpool (BLAS and OpenMP implementations) to manage the number of blas workers. We could for instance set it to the same as cores by default. That would provide more reasonable defaults, but still let users control what’s happening. If corresponding env variables are set, we could also default to those…

In the worker threads in the pymc sampler we could then work out the number of blas workers for each chain (ie num_blas_workers // cores and warn if it isn’t divisible).

I guess for nutpie we could pass num_blas_workers directly, I think that should use a common blas thread pool…

I opened this about the nutpie issue: Handle more AdvancedSetSubtensor ops in numba · Issue #772 · pymc-devs/pytensor · GitHub

@jessegrabowski I went through that env.yml and one by one matched the packages by downgrading packages in the fresh environment. That is how I found out that the open_blas and sqlite packages were the culprits. I don’t know if those specific versions of those packages set environment variables upon installation. When it comes to sampling speed, after downgrading those packages I get a wall time of around 2 mins and PyMC says that it took around 50 seconds to sample the model.

I am using a MacBook Pro with an M1 Pro cpu in it.

1 Like

@aseyboldt I was not able to install the libblas build as openblas. Conda can’t find the package in either conga-forge or defaults channels.

That’s strange. I see the package, and pixi also seems to solve with that requirement for osx just fine…

https://anaconda.org/conda-forge/libblas/files
lists the package that I think we want here: osx-arm64/libblas-3.9.0-22_osxarm64_openblas.conda

I pushed a change to the pixi repo that hopefully should work on osx-arm. Could you try if that works?
It needs pixi installed (for instance brew install pixi if you are using brew) and then

git clone https://github.com/aseyboldt/tmp-benchmark-rethink.git
cd tmp-benchmark-rethink
pixi run --environment openblas-new run-benchmarks
pixi run --environment openblas-old run-benchmarks

Should run the benchmark with an old and a new pymc, using openblas in both.

1 Like

@aseyboldt I ran the above. Here are the results for openblas-new:

here are the results for openblas-old:

I am not sure why these have divergences, when I run it after installing the old version of sqlite I get no divergences and it is still sampling much faster

1 Like

I also got the divergences when I ran the model with the BLAS flags correctly set. Are you sure the model I copied there is the same one you were running before? It’s the “naive” one.

Yeah, I am sure. That is the one I am running. Let me check the code and see if there are any transformations that we may have missed.

Thanks. That makes sense so far. I don’t know where the slight performance difference is coming from, but that might just be noise.
The logp function of this model very much bound by some dense linear algebra, so the version of blas should have a large impact. openblas also doesn’t seem to be all that well optimized for arm macs, so it might make sense if that has significantly worse performance than accelerate.

I also included a accelerate env in the repo. Could you try what this does?

git pull origin
pixi run --environment accelerate-new run-benchmarks
pixi run --environment accelerate-old run-benchmarks

I really can’t imagine what sqlite could possibly have to do with this, maybe this just somehow changes other versions of installed packages. I’m pretty sure this has to be either a compiler or a blas issue, and I’d be willing to bet by now that it is blas.

If this works, you could then try to increase the number of threads the different blas implementation use, by changing the values in the pixi.toml files:

env = { OMP_NUM_THREADS = "1", MKL_NUM_THREADS = "1", VECLIB_MAXIMUM_THREADS = "1" }

My guess as to what’s happening is that with different version configurations you end up with different blas implementations, which each have different strategies for choosing the number of threads, and might scale in very different ways if the number of threads changes.

1 Like

I think we missed something when we pulled out the model from the notebook. I ran the pasteable model from above to match what we were doing with pixi and I am now getting the same divergences using the environment that has the downgraded openblas + sqlite. It is running a little bit faster, though.

Here is the result when running accelerate-new:

and here are the results when running accelerate-old

So how do I know which one of the thread options to increase? Is there a way I can tell it to automatically use the optimal threads?

Just increase all of them. (but the last one should take precedence)
I don’t think there is any way anyone could even know what the optimal number is, other than trying it.

Okay, I will try a fresh environment and I will try to increase the threads. Thank you for your help, @aseyboldt