We’ve found with Stan that the ARM chips on Mac OS X with integrated memory are very fast for sampling. On the order of 2–4 times faster than equivalently priced Intel/AMD-based Windows notebooks. But this is not using GPU, just CPU-based computation.
I’m using the AMD Ryzen 5600x with 32 gigs of ram. And I’m using the Radeon 5700xt (although I’m told that GPU does not provide any significant boost in state space).
If you have larger models which hit memory in a random way, like in large mixed effects models, especially for spatio-tempororal models, and you’re running four or more parallel chains, the bottleneck is often memory bandwidth and cache. In these situations, ARM can be three or four times faster even with slower CPUs.
Yes, it can be optimized, but it can be hard with multiple dimensions. For example, in the model we fit for UK Covid prevalence, we were modeling local authorities of about 200K people, so there were about 400 of them of varying sizes, Then there’s daily observations for a year. You’re now looking at 365 x 400 random effects. For autodiff, those are pretty big.
There’s also the issue with the autodiff expression graph for reverse mode. That is very hard to keep memory local because of the way distributions index arguments. What we’ve found is that what the autodiff folks call “checkpointing” and the programming language folks call “partial evaluation” can be super useful for memory locality for autodiff. It basically just finds a complete subgraph of the final graph and evaluates autodiff locally for that, which you can do in Stan’s autodiff by stacking—you might be able to do that in PyTensor, too. We did this for parallelism in a map operation (like vmap in JAX, but not nearly as convenient), then were surprised to see that it sped up models even using just one CPU. The only explanation we have for this behavior is memory locality.