NUTS speeds issue in model?

Your input matrix has a really large condition number, which means that many of the columns are likely redundant:

np.linalg.cond(train_norm_retained) # ==> 7.768470541187648e+17

You can try some dimension reduction trick first on train_norm_retained (for example, a PCA or SVD).