Problems in defining weakly informative priors for a categorical multilevel mixture model

mercer · January 30, 2025, 10:10pm

I am trying to write the neurobiological model of this study with PyMC. In the original formulation some priors and hyperpriors of the model were defined as Half-Cauchy(0,2), but the model was then developed by the researchers using rjags (here is the JAGS model), and since in JAGS there are no Half-Cauchy these priors and hyperpriors were defined as dscaled.gamma(2,1) T(1E-9,). In my PyMC implementation for these priors i tried using pymc.HalfCauchy(beta=2) and pymc.Truncated(dist=pymc.HalfCauchy.dist(beta=2), lower=1e-9) since there is the possibility, but in both cases when i try to use pymc.HalfCauchy in the model the sampler get to 75%-80% in a few hours and then gets stuck and freezes showing no estimated remaining time and showing no progress even after a day.
Online i couldnt find much about the dscaled.gamma, the only thing i found is a brief explanation in the jags manual pg. 61 which states:
“The scaled gamma distribution with scale s and degrees of freedom df is defined by
\tau \sim dscaled.gamma(s, df)
where \tau (tau) is a positive scalar-valued random variable.
If \tau is the precision parameter of the normal distribution, then dscaled.gamma implicitly
defines a prior on the standard deviation \sigma = \frac{1}{ \sqrt{\tau}) }: \sigma \sim st^{+}_{df}
where t^{+}_{df} represents the t-distribution with df degrees of freedom restricted to positive values. Figure 10.1 illustrates the half-t distribution with 2 degrees of freedom. The density is flat for very small values (\sigma << S) and has a long flat tail for large values (\sigma >> S). These features protect against mis-specification of the prior scale S. Choosing a large S is harmless as it simply makes the prior flatter, whereas if S is too small then the long flat tail ensures that sufficient data will dominate the prior. Gelman [2006] calls this a “weakly informative” prior. For comparison, figure 10.1 also shows the density on \sigma in the limit as the degrees of freedom df increases to infinity. The distribution then becomes a half-normal with standard deviation S. The effect of increasing the degrees of freedom is to diminish the weight of the tail.”
Given the problems i am having with the Half-Cauchy, i was thinking about defining the priors by replicating the dscaled.gamma(2,1) T(1E-9,) behaviour in PyMC, and considering the above documentation i was thinking about doing something like this:

        tdist = pm.HalfStudentT("tdist", nu=1)
        tau = pm.Deterministic("tau", 1 / ((2 * tdist)**2))

still i dont know how i could truncate since the pymc.HalfStudentT doesnt have a logcdf and i cant truncate a deterministic node.
Any help or suggestion would be really appreciated, and if needed i can share the full code of the model.

jessegrabowski · January 31, 2025, 3:49am

Are you are wanting to truncate to prevent the Gamma RV from being precisely zero? In that case, would it not suffice to simply add 1e-9 to it? But Gamma(2 ,1) already rules out zero, so I’m confused about what your ultimate objective is.

mercer · January 31, 2025, 12:05pm

Sorry i know the post is confusing, what i am trying to do is finding a way to define in PyMC the weakly informative priors for the model explained in this study, and my ultimate goal would be to achieve similar results to the study’s JAGS implementation of the model.
I’m not using gamma(2,1) myself, the dscaled.gamma(2,1) T(1E-9,) was used by the researchers to define the forementioned weakly informative priors in the JAGS implementation (that you can see here) of the study’s model im trying to implement in PyMC.
Also, i’m not familiar with JAGS at all, but based on what i found in the JAGS manual (see original post) the dscaled.gamma doesnt behave like a Gamma, (Gamma in JAGS would be: dgamma not dscaled.gamma).

jessegrabowski · January 31, 2025, 12:26pm

You can create new random variables from manipulations of existing random variables using pm.CustomDist. For example:

import pymc as pm
import arviz as az

def my_dist(nu, shape=None):
    t = pm.HalfStudentT.dist(nu=nu, shape=shape)
    return pm.math.clip(1 / (2 * t) ** 2, 1e-9, np.inf)

with pm.Model() as m:
    sigma = pm.CustomDist('sigma', 2, dist=my_dist)
    prior = pm.sample_prior_predictive()
    
az.plot_posterior(prior.prior)

Generates this absolute beauty:

mercer · January 31, 2025, 2:57pm

This looks way cleaner than my code in the op, but, since the logcdf for HalfStudentT is not implemented, as soon as i try to sample the posterior still raises a NotImplementedError in measurable_transform_logcdf probably caused by the pm.math.clip.

jessegrabowski · January 31, 2025, 3:19pm

Just add 1e-9 instead of clip

Mauro · February 4, 2025, 12:59pm

if the problem is only to avoid the lowest value being 0, you can also use epsilon:

>>> import sys
>>> epsilon = sys.float_info.epsilon
>>> print(0 + epsilon)
2.220446049250313e-16

bob-carpenter · February 6, 2025, 6:49pm

If you have probability mass piling up at zero, it’s saying something about your data’s compatibility with your model (prior and likelihood). You can sweep it under the rug by adding 1e-9 or clipping (which aren’t quite the same density, but very close, and the addition is much more numerically stable). Or you could try to get to the bottom of why the data’s consistent with a value of zero for that parameter. Usually it’s because you have a partial pooling model and the data’s compatible with complete pooling. Depending on how much probability mass is piled up near zero, you might want to just switch to a complete pooling model in those situations.

The choice of a half normal or half Cauchy is intentionally allowing positive density at a value of zero in order to allow for complete pooling. But it can be a computational nightmare due to underflow in floating point caclulations and because of the unconstrained geometry in HMC after you transform the positive-constrained paraemter. A lognormal goes to zero at zero, as does an inverse gamma. The gamma can go either way depending on the paramers chosen.

A Cauchy(0, 2) can be deceptive because of the wide tails. The central 99% interval is roughly (0.03, 127). Unless you think values of 0.01 and 100 are plausible here, I’d recommend a weakly informative prior (i.e., one determining the rough scale of the answer). Otherwise, unless you have a lot of data, the prior will push the posterior into the tails. Tightening the prior can also help avoid posterior mass piling up near zero and at implausibly high values.

jessegrabowski · February 6, 2025, 6:53pm

I think the OP was hell-bent on implementing exactly the JAGS model, so I avoided suggesting he do something else. But in general I 100% agree a different prior would be better. My impression is that many papers choose certain priors because they are convenient for the sampler the authors were working with (often Gibbs). The same considerations that lead them to choose these priors are often either not releveant or actively harmful when switching to HMC.

Topic		Replies	Views
Selecting prior distributions for Cauchy as likelihood distibution Questions prior	5	1021	January 11, 2022
How to chose beta for Half Cauchy distribution? Questions	4	2901	November 8, 2019
Hierarchical logistic regression giving non-sensible results? Do I formulate it correctly? Questions	15	844	August 4, 2020
Error with sample_posterior_predictive: "Cannot sample from half_flat variable" v5	3	29	June 3, 2025
GLM with a truncated Gamma distribution v5 linear_model , modeling	9	756	September 22, 2023

Problems in defining weakly informative priors for a categorical multilevel mixture model

Related topics