Just to copy the language of the paper here, the typical Bayesian model
x_i | \beta \mathop{\sim}_{\mathrm{i.i.d.}} p(x_i|\beta) \;\;\; \beta \sim p(\beta | \alpha)
is replaced wiht a hierarchical model
x_i | \beta_i \mathop{\sim} p(x_i | \beta_i) \;\;\; \beta_i \mathop{\sim}_{\mathrm{i.i.d.}} p(\beta_i | \alpha)
In effect this assigns one parameter per datapoint, which I would expect to generate lots and lots of divergences. Ideally you could choose p(\beta_i|\alpha) so that it can be marginalized analytically so that
p(x_i|\alpha) = \int p(x_i|\beta_i)p(\beta_i|\alpha)d\beta_i
is closed-form. Given that section 3.1 is a brief explanation of conjugate exponential families, it seems the authors expect practitioners to either (i) judiciously choose priors to make the above integral have an analytic solution; or (ii) apply variational inference [sections 4+5].
Based on my reading of the above, it appears that you have \beta_i parametrizing the scale \sigma^2 of a truncated normal, and p(\beta_i | \alpha) as \mathrm{HalfCauchy}(\tau) for some fixed \tau. Given that you only have a single point from which to estimate the scale, the model would appear to be barely determined; and the HalfCauchy tail is not helping.
I suspect that HalfNormal variables, particularly those realized near to 0, aren’t particularly informative as to the underlying \sigma, since P(0|\sigma) \propto \frac{1}{\sigma}. Even though \sigma=10 and \sigma=20 differ significantly, the probability of getting a point near 0 isn’t all that different (only O(2:1)); so most points won’t be particularly informative to the value of \sigma_i.
I suspect the most effective way to sample from a local version of this truncated normal model (which may not match the prior you’d most like to use!) would be to parametrize the \sigma_i on the log scale; and (since we are sampling anyway) to use a prior on \alpha. This would let you use a constraining light tail for sampling, while using the heavy tailed lognormal for the likelihood.
Notably, while I get divergences for HalfCauchy and HalfStudentT(3) distributions, this parameterization worked well for me:
with pm.Model() as local_reparam:
scale_log = np.log(alpha)
scale_offset_sigma = pm.HalfNormal('scale_offset_sigma', 1)
scale_offset_log = pm.Normal('scale_log', mu=0, sigma=scale_offset_sigma, shape=N)
scale = pm.Deterministic('sigma_i', tt.exp(scale_log + scale_offset_log))
values = pm.HalfNormal('x_i', scale, observed=dataset_realizations['x_i'][0,:])
local_repar_tr = pm.sample(800, tune=1500, chains=8, cores=2, nuts_kwargs=dict(target_accept=0.9))
pm.traceplot(local_repar_tr, ['scale_offset_sigma', 'sigma_i'],
coords={'sigma_i_dim_0': range(5)});
And I just used pymc3 to generate the data
alpha = 0.5
N = 80
N_data=10
# generate data
with pm.Model() as mod:
scales = pm.HalfStudentT('sigma_i', nu=3, sigma=alpha, shape=(N,))
values = pm.HalfNormal('x_i', scales, shape=(N,))
dataset_realizations = pm.sample_prior_predictive(N_data)