Any suggestions on speeding up the model?

PyMC uses pytensor as a computational backend, which does lazy execution. Before you call pm.sample, nothing is computed; only a compute graph is constructed. This code will always execute quickly as a result – it’s totally independent of the complexity of the computation.

You can see here for a discussion about profiling. I’m pretty sure the bottleneck due to your scan, and I’m still curious why you’re implementing a cholesky decomposition by hand.