Slow sampling speed with newer versions of PyMC

Packages could in theory set environment variables like this in the activation script, but I really hope nothing does that, and I don’t think we should either. That would mean that simply installing a package into an environment (not even using it!) would have a big impact on how other packages behave.

I think it is more likely that accelerate doesn’t behave so badly if there are too many worker threads. I’ve seen a couple of examples where I think openblas specifically behaves really strange if there are misconfigurations like a vastly too large number of worker threads. I didn’t do any proper investigation, but maybe openblas uses spinnlocks for aggressively or something like that?

I guess what we could do, would be to add a num_blas_workers or so kwarg to pm.sample, that uses for instance GitHub - joblib/threadpoolctl: Python helpers to limit the number of threads used in native libraries that handle their own internal threadpool (BLAS and OpenMP implementations) to manage the number of blas workers. We could for instance set it to the same as cores by default. That would provide more reasonable defaults, but still let users control what’s happening. If corresponding env variables are set, we could also default to those…

In the worker threads in the pymc sampler we could then work out the number of blas workers for each chain (ie num_blas_workers // cores and warn if it isn’t divisible).

I guess for nutpie we could pass num_blas_workers directly, I think that should use a common blas thread pool…