Thank you! I run into the same problem when I try to estimate the resource requirements on the login node of our hpc before submitting jobs. It would use lots of threads, which leads to vastly increased running time of pm.fit() in my case. Setting import os; environ['OPENBLAS_NUM_THREADS'] = '4' at the beginning of my scripts seems to work well to limit this.
Interestingly, limiting the threads on my laptop also makes pm.fit() run a bit faster than using all cores (which it does by default). I don’t understand enough about what’s going on behind the scenes, but it seems that the sweet-spot is around 4 threads?