Slow sampling speed with newer versions of PyMC

I can’t directly benchmark this myself, because I don’t have a mac. But I set up a small pixi repo that should hopefully make experimentation with this a bit easier and more reproducable?

On linux I get an ETA of ~15min early in tuning (I wasn’t patient enough to wait) with both openblas and mkl. To test this on mac, you can change the platform entry in pixi.toml to either "osx-64" or "osx-arm64" on a ARM machine, and change all mkl mentions to accellerate. You can then run different configurations with

# Run with an old pymc
pixi run --environment openblas-old run-benchmarks
# Run with a new pymc
pixi run --environment openblas-old run-benchmarks
# Run with mkl and old pymc (or change to accellerate on mac)
pixi run --environment mkl-old run-benchmarks

In the pixi.toml file this also sets the number of threads for openblas and mkl to 1, different values might make it faster or slower, also depending a lot on how many cores you have.
Keep in mind that it will use 4 times the number of threads, because of the 4 chains that are run in parallel, which each use blas independently.

About the nutpie problems: This is gotta be slow because of the dammed AdvancedSetSubtensor op in that graph. We really need to fix this, or at least come up with rewrites that solve most of those cases…

Edit

Link to the repo: GitHub - aseyboldt/tmp-benchmark-rethink

1 Like