Also with regards to
Hopefully the linear algebra you used gives you performance gains, too.
I’ve noticed that the linear algebra libraries installed for most cloud instances I spin up are parallelized by default, e,g., PARPACK. I wonder i) if any of your observed speedup comes from “hidden” library-level parallelism; ii) whether pymc4 (or even pymc3) turns off fine-grained (library) parallelism when multiple chains are being run in parallel. I’ve seen cores=4 seemingly eat up all 8 cores, so maybe not?