State Space: Exogenous Vars - Conflicting dimensions for time

What kind of CPU/GPU/memory config do you have?

We’ve found with Stan that the ARM chips on Mac OS X with integrated memory are very fast for sampling. On the order of 2–4 times faster than equivalently priced Intel/AMD-based Windows notebooks. But this is not using GPU, just CPU-based computation.

Hi @bob-carpenter ,

I’m using the AMD Ryzen 5600x with 32 gigs of ram. And I’m using the Radeon 5700xt (although I’m told that GPU does not provide any significant boost in state space).

Right now I’m clocking about 4500 Mhz on the CPU.

Thanks,
Roy

If you have larger models which hit memory in a random way, like in large mixed effects models, especially for spatio-tempororal models, and you’re running four or more parallel chains, the bottleneck is often memory bandwidth and cache. In these situations, ARM can be three or four times faster even with slower CPUs.

Had my buddy run my model on his MacBook with an M2. He ran it in 41 min to my 180 min. Apple Beets Battlestar Galactica.

-Roy

You mean because the observations are not sorted, so the needed coeficient have to be retrieved out of order?

We could actually optimize that with constant indices / observations.

Yes, it can be optimized, but it can be hard with multiple dimensions. For example, in the model we fit for UK Covid prevalence, we were modeling local authorities of about 200K people, so there were about 400 of them of varying sizes, Then there’s daily observations for a year. You’re now looking at 365 x 400 random effects. For autodiff, those are pretty big.

There’s also the issue with the autodiff expression graph for reverse mode. That is very hard to keep memory local because of the way distributions index arguments. What we’ve found is that what the autodiff folks call “checkpointing” and the programming language folks call “partial evaluation” can be super useful for memory locality for autodiff. It basically just finds a complete subgraph of the final graph and evaluates autodiff locally for that, which you can do in Stan’s autodiff by stacking—you might be able to do that in PyTensor, too. We did this for parallelism in a map operation (like vmap in JAX, but not nearly as convenient), then were surprised to see that it sped up models even using just one CPU. The only explanation we have for this behavior is memory locality.

1 Like

Quick update: I rented an apple with an M4, 16 gigs, and it brought the sampling time down to 25 minutes. Wild stuff.

2 Likes