Expected Results with 90% Divergences - Hierarchical Bayesian Issue

Hi There!

I am trying to create a hierarchical bayesian model to infer the distribution shape of e_model from a number of observed data arrays [observed_r, std_r, observed_c, std_c, observed_ars, N] where N is the length of the other arrays. Now, my code works completely fine and as expected (both with simulated data and real data) when not including the Potential function “weight”, but results in upwards of 90% divergences when it is included, despite still outputting the desired result (just with extremely low effective sample size). It is a very similar issue to this thread that resulted in an inherent change in the code, so not sure if this requires the same (Good results but many divergent samples: what, me worry?)

Here is the full model (note I am using pymc v5.10.3):

def calc_implied_c(e_model, w_model, r_model):
    g = (1 + e_model * pt.sin(w_model)) / pt.sqrt(1 - e_model**2)
    return r_model * g ** 3

def prob_weighting(e_model, w_model, ars_model):
    constraint = e_model < (1-1/ars_model)
    obs = pm.math.switch(constraint, 1/ars_model *((1+e_model*pt.sin(w_model))/(1-e_model**2)), -np.inf)
    return weighting

def log_likelihood (observed_r, std_r, observed_c, std_c, observed_ars, N):
    with pm.Model() as model:

        alpha_e = pm.Uniform('alpha', 0, 10, initval=2)
        beta_e = pm.Uniform('beta', 0,10, initval=5)

        e_model = pm.Beta('e_model', alpha=alpha_e, beta=beta_e, shape=N)
        w_model = pm.Uniform('w_model', lower=0, upper=2*np.pi, shape=N)
        weight = pm.Potential('weight', prob_weighting(e_model, w_model, observed_ars))

        log_r = pm.Normal('log_r', np.log(observed_r), std_r/observed_r, shape=N)
        log_c = pm.Deterministic('log_c', pt.log(calc_implied_c(e_model, w_model, r)))

        log_obs_c = pm.Normal('log_obs_c', mu=log_c, sigma=std_c/observed_c, observed=np.log(observed_c), shape=N)
        idata = pm.sample(10000, tune=40000, chains=4, cores=4, nuts={"target_accept":0.99}, return_inferencedata=True)

    return idata

I also kept adjusting the not met criteria and found that my code results in no divergences when the fail condition is >-1e3 as in -9.999e2 and higher are fine, but obviously not want I want mathematically, while anything -1e3 or below as a penalty results in mass divergences.

I have debugged this for countless hours, found no use out of the parallel plots, changed coding functionality from pytensor to pymc math to using np inf, reparameterising the variables including the Potential function to log space, and nothing works. Something interesting is even a very simple setup such as the following:

with pm.Model() as model:
    x = pm.Normal('x', mu=0., sigma=1.)

    pm.Potential('obs_x',pt.switch(x < 0., -pt.inf, 0.))
    idata = pm.sample()

regardless of -pt.inf or -np.inf, all the way down to -1e3 results in near 100% divergences.

I am not extremely familiar with pymc so if anyone knows why this is occurring or what the special constraint boundary between -9.9999…e2 and -1e3 is for the potential function is then please let me know!


Hi Tyfair, that is the expected behavior with using boundary conditions and potentials.

Using a switch condition to trigger a penalty will introduce a discontinuity on the posterior surface. It’s like a steep, abrupt drop off a cliff. We can visualize the drop off with a simple setup.

with pm.Model() as divergence_demo_1:
    a = pm.Beta('a',1,1)

    # if a is less than 1, no problem
    # if a is greater than 1, medium sized problem

    pm.Potential('discontinuity',pm.math.log(pm.math.switch(pm.math.lt(a,0.5), 1, 0.1)))

    trace = pm.sample(chains=1)

Discontinuity can be tough on gradient-based samplers like NUTS. They will check the gradient at immediate locations between each sample. If the value of the posterior at the new location departs widely from what you would expect based on following the gradients, that triggers the divergence. So if the value of the posterior is -np.inf, that is always too surprising of a jump and always triggers a divergence.

The reason why you find such unusual behaviour, especially the bit about -9.99e2 vs -1e3 is that NUTS comes with a hard-coded threshold for what counts as too surprising a jump to be a divergence and what doesn’t. It just so happens that the difference in penalties between -9.99e2 and -1e3 hits that threshold.

1 Like

Hi Daniel, Thank you for the clarification! That all makes sense.

I had a feeling that the steep gradient may have been the case. I have attempted to use the model with a penalty slightly above the boundary and it has performed well.


1 Like