Using the nutpie sampler we just found a significant performance difference between using the default OpenBLAS vs the Apple’s accelerate library on an M1 Mac. Basically this is MKL for ARM64 chips.
You can change the blas version that’s installed in your env using:
micromamba install "libblas=*=*accelerate"
Installed. Definitely notice a speed-up.
Does this work on pip as well?
No, it does not work on PIP. You get error that asks if you meant “==”. At least that’s what it did for me.
I think only conda/mamba ship their own libblas. pip would try to use the one that’s installed on your system. So you would probably have to install it there (via brew?) and then make sure everything gets compiled against that.
We should probably add this to the Conda recipe so that it happens automatically.
I definitely have to try to make use of my M2 CPU… Still running MKL via Rosetta2 since I only recently exited dependency hell and can’t face it with another new stack
In case anyone cares, this is my current minimal
condaenv.yml. Sadly the very latest
pymc=5.7.0 leads to more dependency hell via
# Manually created as-at 2022-02-15
# Last updated as-at 2023-08-02
# + Creates a virtual env for project usage
# + Require running on Intel x86 AMD64 CPU (or Rosetta2 on MacOS)
# + Install with mamba via Makefile, there's also a fuller set of deps to be
# installed by pip in the pyproject.toml
# + Force MKL version: 2022 version(s) dont work on MacOS
# see https://stackoverflow.com/a/71640311/1165112
# + Force install BLAS with MKL via libblas (note not "blas")
# + Force install numpy MKL: only available in defaults (pkgs/main)
# see https://github.com/conda-forge/numpy-feedstock/issues/84#issuecomment-385186685
# - defaults
- pkgs/main::numpy>=1.24.3 # force numpy MKL see NOTE
- conda-forge::libblas=*[build=*mkl] # force BLAS with MKL see NOTE
- conda-forge::libcblas=*[build=*mkl] # force BLAS with MKL see NOTE
- conda-forge::liblapack=*[build=*mkl] # force BLAS with MKL see NOTE
- conda-forge::mkl==2021.4.* # force MKL version see NOTE
Installation on native OSX ARM64 is quite trivial for me, give it a shot, I’d give it a 95% probability it will just work.
Hmm, 95% seems catastrophically low to me. Let’s rather shoot for 99.9%!
Thanks @jonsedar for sharing your magic. Let’s add it to the PyTensor conda-forge feedstock. I’m going to have to read through all your notes first.
I’m all ears to any improvements or fat-cutting you can suggest This is just the result of trial and error and probably contains unnecessary stuff.
I’m installing into the latest MacOS (Ventura) via mambaforge
Any luck with improving performance on your M2? I just got one would love to learn what worked for you.
If that’s aimed at me, I’m still using the same recipe I noted above - no issues for now. I’ll worry about it if/when I need to speed up sampling
Note that there really is no point to emulating x86 on ARM64 chips, it’s just slower for no benefit.
Sounds totally reasonable, and I would love to move to to native arm64 processing, but I typically always need to deploy my code to a non-arm CPU for production usage. Are there likely to be environment issues if I have to port code from ARM to Intel?
No, I don’t see how that could happen.