Implementing rounding (by manual integration) more efficiently

The truncation can be a problem but not necessarily. I see energy plots like often with models I can’t use the default sampler on. Basically the blue curve is the distribution of total Hamiltonian system energy at each sample (kinetic + potential , where kinetic is the momentum sampled from a normal and potential is the logp). Since Kinetic is a normal, the truncation can only result from the logp term. In general these should be transformed to an unconstrained space during sampling so you shouldn’t see this, but it looks like in your case no such transformation occurs. I’m not sure if you’re allowed to manually specify a transformation to a pm.CustomDist, but it might be worth a try if so. Even if you are, I’m not sure what transformation would be appropriate in your case…

Basically when the sampler spends time in those low energy states in the left tail of the distribution it will be doing inefficient exploration.

I have no idea if this is the source of your problem or not, though. All the other diagnostics look fine, so the bottleneck might really be computational. You could try timing the logp and dlogp functions (use model.compile_logp and .compile_dlogp. If dlogp is really the bottleneck you could try a gradient-free sampler like SMC?