State Space: Exogenous Vars - Conflicting dimensions for time

bob-carpenter · February 19, 2025, 8:35pm

What kind of CPU/GPU/memory config do you have?

We’ve found with Stan that the ARM chips on Mac OS X with integrated memory are very fast for sampling. On the order of 2–4 times faster than equivalently priced Intel/AMD-based Windows notebooks. But this is not using GPU, just CPU-based computation.

rOy_bOy · February 19, 2025, 8:39pm

Hi @bob-carpenter ,

I’m using the AMD Ryzen 5600x with 32 gigs of ram. And I’m using the Radeon 5700xt (although I’m told that GPU does not provide any significant boost in state space).

Right now I’m clocking about 4500 Mhz on the CPU.

Thanks,
Roy

bob-carpenter · February 19, 2025, 10:02pm

If you have larger models which hit memory in a random way, like in large mixed effects models, especially for spatio-tempororal models, and you’re running four or more parallel chains, the bottleneck is often memory bandwidth and cache. In these situations, ARM can be three or four times faster even with slower CPUs.

rOy_bOy · February 21, 2025, 1:03am

Had my buddy run my model on his MacBook with an M2. He ran it in 41 min to my 180 min. Apple Beets Battlestar Galactica.

-Roy

ricardoV94 · February 21, 2025, 6:05am

You mean because the observations are not sorted, so the needed coeficient have to be retrieved out of order?

We could actually optimize that with constant indices / observations.

bob-carpenter · February 21, 2025, 3:11pm

Yes, it can be optimized, but it can be hard with multiple dimensions. For example, in the model we fit for UK Covid prevalence, we were modeling local authorities of about 200K people, so there were about 400 of them of varying sizes, Then there’s daily observations for a year. You’re now looking at 365 x 400 random effects. For autodiff, those are pretty big.

There’s also the issue with the autodiff expression graph for reverse mode. That is very hard to keep memory local because of the way distributions index arguments. What we’ve found is that what the autodiff folks call “checkpointing” and the programming language folks call “partial evaluation” can be super useful for memory locality for autodiff. It basically just finds a complete subgraph of the final graph and evaluates autodiff locally for that, which you can do in Stan’s autodiff by stacking—you might be able to do that in PyTensor, too. We did this for parallelism in a map operation (like vmap in JAX, but not nearly as convenient), then were surprised to see that it sped up models even using just one CPU. The only explanation we have for this behavior is memory locality.

rOy_bOy · February 24, 2025, 9:53pm

Quick update: I rented an apple with an M4, 16 gigs, and it brought the sampling time down to 25 minutes. Wild stuff.

Topic		Replies	Views
Predict with new coords leads to conflicting sizes v5	5	1343	October 12, 2022
Model finishes sampling then raises dimension error v5	3	34	April 16, 2025
Parallel dimensions...? Utilizing multiple datetime dimensions for different parameters v5 modeling , arviz	7	499	July 31, 2023
Understanding dimensions/shapes of variables v5	5	1302	August 29, 2023
Multi-dimensional Dims Do not Seem to Work v5 modeling	3	883	July 11, 2022

State Space: Exogenous Vars - Conflicting dimensions for time

Related topics