I can’t directly benchmark this myself, because I don’t have a mac. But I set up a small pixi repo that should hopefully make experimentation with this a bit easier and more reproducable?
On linux I get an ETA of ~15min early in tuning (I wasn’t patient enough to wait) with both openblas and mkl. To test this on mac, you can change the platform entry in pixi.toml to either "osx-64" or "osx-arm64" on a ARM machine, and change all mkl mentions to accellerate. You can then run different configurations with
# Run with an old pymc
pixi run --environment openblas-old run-benchmarks
# Run with a new pymc
pixi run --environment openblas-old run-benchmarks
# Run with mkl and old pymc (or change to accellerate on mac)
pixi run --environment mkl-old run-benchmarks
In the pixi.toml file this also sets the number of threads for openblas and mkl to 1, different values might make it faster or slower, also depending a lot on how many cores you have.
Keep in mind that it will use 4 times the number of threads, because of the 4 chains that are run in parallel, which each use blas independently.
About the nutpie problems: This is gotta be slow because of the dammed AdvancedSetSubtensor op in that graph. We really need to fix this, or at least come up with rewrites that solve most of those cases…
That is interesting
Why would sqlite have anything to do with this?
And does it also work if you specify the libblas build as openblas (ie conda install libblas=*=*openblas*)?
I got the same slow speed on linux (via WLS2), fwiw
Edit: but i didn’t think to check performance with cores=1, this gets me to the ~15 minutes you reported early in tuning, and it goes down from there to around 5. So it might be something to do with a resource bottleneck due to BLAS?
Could it simply be the lack of setting the number of threads in blas?
I also get pretty bad performance if I don’t set those, because blas is trying to use way too many cores.
At the top of a script doesn’t seem to do anything for me, am I missing something?
Also if it is just a BLAS resource problem, I’m still curious why the specific bundle of packages in that rethinking repo results in good sampling, while the fresh install doesn’t
How many threads is it using? You can at least sanity check it with htop or the system monitor or so. If there are more worker threads in total than you have cores something is doing something stupid…
Depending on the blas you are using, you might also have to set for instance OPENBLAS_NUM_THREADS. Which blas recognizes which environment variable can be a bit messy…
Packages could in theory set environment variables like this in the activation script, but I really hope nothing does that, and I don’t think we should either. That would mean that simply installing a package into an environment (not even using it!) would have a big impact on how other packages behave.
I think it is more likely that accelerate doesn’t behave so badly if there are too many worker threads. I’ve seen a couple of examples where I think openblas specifically behaves really strange if there are misconfigurations like a vastly too large number of worker threads. I didn’t do any proper investigation, but maybe openblas uses spinnlocks for aggressively or something like that?
In the worker threads in the pymc sampler we could then work out the number of blas workers for each chain (ie num_blas_workers // cores and warn if it isn’t divisible).
I guess for nutpie we could pass num_blas_workers directly, I think that should use a common blas thread pool…
@jessegrabowski I went through that env.yml and one by one matched the packages by downgrading packages in the fresh environment. That is how I found out that the open_blas and sqlite packages were the culprits. I don’t know if those specific versions of those packages set environment variables upon installation. When it comes to sampling speed, after downgrading those packages I get a wall time of around 2 mins and PyMC says that it took around 50 seconds to sample the model.
I pushed a change to the pixi repo that hopefully should work on osx-arm. Could you try if that works?
It needs pixi installed (for instance brew install pixi if you are using brew) and then
git clone https://github.com/aseyboldt/tmp-benchmark-rethink.git
cd tmp-benchmark-rethink
pixi run --environment openblas-new run-benchmarks
pixi run --environment openblas-old run-benchmarks
Should run the benchmark with an old and a new pymc, using openblas in both.
I am not sure why these have divergences, when I run it after installing the old version of sqlite I get no divergences and it is still sampling much faster
I also got the divergences when I ran the model with the BLAS flags correctly set. Are you sure the model I copied there is the same one you were running before? It’s the “naive” one.
Thanks. That makes sense so far. I don’t know where the slight performance difference is coming from, but that might just be noise.
The logp function of this model very much bound by some dense linear algebra, so the version of blas should have a large impact. openblas also doesn’t seem to be all that well optimized for arm macs, so it might make sense if that has significantly worse performance than accelerate.
I also included a accelerate env in the repo. Could you try what this does?
git pull origin
pixi run --environment accelerate-new run-benchmarks
pixi run --environment accelerate-old run-benchmarks
I really can’t imagine what sqlite could possibly have to do with this, maybe this just somehow changes other versions of installed packages. I’m pretty sure this has to be either a compiler or a blas issue, and I’d be willing to bet by now that it is blas.
If this works, you could then try to increase the number of threads the different blas implementation use, by changing the values in the pixi.toml files:
My guess as to what’s happening is that with different version configurations you end up with different blas implementations, which each have different strategies for choosing the number of threads, and might scale in very different ways if the number of threads changes.
I think we missed something when we pulled out the model from the notebook. I ran the pasteable model from above to match what we were doing with pixi and I am now getting the same divergences using the environment that has the downgraded openblas + sqlite. It is running a little bit faster, though.
Just increase all of them. (but the last one should take precedence)
I don’t think there is any way anyone could even know what the optimal number is, other than trying it.