Ideas for reparameterizing models/changing priors to avoid divergences

Hey everyone

I have been looking into solving problems related to divergens in my model and looking at different tips like Diagnosing Biased Inference with Divergences, Why hierarchical models are awesome, tricky, and Bayesian and Cookbook — Bayesian Modelling with PyMC3.

I wanted to ask if defining the prior as the logarithm of its possible value can help NUTS since priors with large values would be on the same scale or closer to the same scale as priors with smaller values. As an example take the slope and intercept of a linear function (i am using uniform priors because it is easy to translate the limits to log)

with Model() as linmodel:
sigma = HalfCauchy('sigma', beta=10, testval=1.)
logintercept = Uniform('intercept', lower=tt.log(100.0), upper=tt.log(200.0))
slope =  Uniform('slope', lower=2.0 upper=4.0)

likelihood = Normal('y', mu=tt.exp(logintercept) + slope * x,
                    sigma=sigma, observed=y)

trace = sample(1000)

Does such a change actually help the sampler? Even though the the parameters are closer to the same scale, is the sampler not more sensitive to changes in the intercept?

Another question i have is related to hierarchical models. In the examples i have seen like the ones linked above the parameters are centered on 0, but what if they are not? Can we just specify a non-centered model to be from another mu value? As an example if mu=5 could we write for The Eight Schools Model

with pm.Model() as NonCentered_eight:
mu = pm.Normal('mu', mu=5, sigma=5)
tau = pm.HalfCauchy('tau', beta=5)
theta_tilde = pm.Normal('theta_t', mu=5, sigma=1, shape=J)
theta = pm.Deterministic('theta', mu + tau * theta_tilde)
obs = pm.Normal('obs', mu=theta, sigma=sigma, observed=y)   

If anyone has any tips or sources for examples like these it would be greatly appreciated.

Hi Bob!
I can’t answer your first question, but it’s interesting, I’m curious about the answer!

As for your second question, I’m not sure, but I don’t think you can do that: the point of the non-centered parametrization is to get the average effect (mu) out of the prior. tau * theta_tilde must then just be a deviation from the baseline.
But if tau * theta_tilde also contains a baseline value, then the model will be overparametrized in my opinion – an infinite number of different values of mu and tau * theta_tilde will produce the same sum mu + tau * theta_tilde.

I see what you are saying, so if we wanted to use a hierarchical model we would have to rewrite the model in such a way that the baseline value is removed beforehand.

Yeah I think so, because the baseline is actually the average effect across clusters. Then you add the deviations, that are specific effects within clusters