Issues with NUTS performance for large scale problems

I think a good thing to try is to set more meaningful mass matrix for NUTS. For example, you can try the custom mass matrix adaptation in Exoplanet: a toolkit for modeling of transit and/or radial velocity observations of exoplanets using PyMC3, or use initialization like advi+adapt_diag, or run once with smaller data, and use the posterior covariance matrix as the mass matrix to initialize the NUTS sampler.