That’s very creative!
You’re right about where the majority of the speedups come from. I think this will be more pronounced in HMC, where the gradient is also done in one shot (as above!). I actually wrote this code while working on unbiased MCMC with couplings, where I also get a speedup from reusing a Cholesky decomposition and a .solve.