Gelman's (alpha + beta)^-5/2 parametrization for beta prior in AB test

Dear experts,

I built a hierarchical model for my AB test of click rates, using Gelman’s parametrization (alpha + beta)^-5/2 (see e.q. 5.9 in BDA) for my beta prior. In the book this parametrization seems to be justified with this statement

"In a problem such as this with a reasonably large amount of data, it is possible to set up a
‘noninformative’ hyperprior density that is dominated by the likelihood and yields a proper
posterior distribution."

where he refers to the rat tumor example.
In my case I only have two observations, one for option A and one for option B, so I am aware of this difference wrt the BDA example. Based on this I built the following model:

def logp_ab(value):
    return tt.log(tt.pow(tt.sum(value), -5/2))

with pm.Model() as model:
    ab = pm.HalfFlat('ab', shape=2, testval=np.asarray([1., 1.]))
    pm.Potential('p(a, b)', logp_ab(ab))
    alpha = pm.Deterministic('alpha', ab[0])
    beta = pm.Deterministic('beta', ab[1])
    theta = pm.Beta('theta', alpha=ab[0], beta=ab[1], shape=2) 
    p = pm.Binomial('y', p=theta, observed=[232, 194], n=[7890, 7795]) 

running this with

pm.sample(draws=8000, tune=5000, chains=4, return_inferencedata=True, cores=4)

yields 8777 divergences, which probably means reparametrizing my model (I tried increasing the target_accept rate and also increase the burn-ins but still gives divergences.
I looked at the distributions of my parameters w/o divergences and w/ divergences, see below, but this doesn’t tell me much to my beginner/untrained eye. Except maybe that the divergences are not clustered around some particular values, which I’d think means that a re-parametrization is required.
What options do I have for ‘weakly informative priors’? E.g. a gamma function for the variance of my beta prior? I am trying to find the ‘weakliest informative prior’ :slight_smile: for this case.

That’s somewhat of an ugly parametrization; using

x = \frac{a}{a+b}
and
y = \sqrt{a+b}

you can sample x and 1/y uniformly, and use the transformation

a = xy^2
b = y^2 - xy^2 = (1-x)y^2

to get an equivalent prior from uniform samples. See Hierarchical AB testing in PyMC3 - #2 by chartl

thank you @chartl ! That’s a great idea and helps, but I’m still getting divergences, also after restricting the range of x and y.

Hmm. Can you do a pairplot of x and 1/y (or however you drew the prior)?

Here it is, with the range [0, 0.1]
pairplot

and here when using [0, 1]
pairplot2

That looks about right. Have you tried slightly increasing target accept? I don’t get any divergences:


with pm.Model() as model:
    x = pm.Uniform('x', 0., 1.)
    yi = pm.Uniform('yi', 0., 1.)
    y = pm.Deterministic('y', 1./yi)
    a = pm.Deterministic('a', x/yi**2)
    b = pm.Deterministic('b', (1-x)/yi**2)
    theta = pm.Beta('theta', alpha=a, beta=a, shape=2) 
    p = pm.Binomial('yobs', p=theta, observed=[232, 194], n=[7890, 7795]) 
    
    tr = pm.sample(500, tune=1000, chains=6, cores=3, nuts_kwargs=dict(target_accept=0.9))

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (6 chains in 3 jobs)
NUTS: [theta, yi, x]
Sampling 6 chains, 0 divergences: 100%|██████████| 9000/9000 [00:04<00:00, 1848.89draws/s]

image

image

thank you for your quick reply and cross checks @chartl! I forgot this in my model:

🤦
Now it runs w/o divergences! However, I’m still not sure why this works: your parametrization has certainly better interpretability, but at the end shouldn’t they both give the same results?
Perhaps it’s related to the obfuscating (at least to me) pm.Potential?
Either way, thanks a ton for your help!

Yes, the likelihoods in terms of (\alpha, \beta) are proportional, so the posteriors should match. The issue is that P(x) \propto x^{-\alpha} is Pareto, and doesn’t have support for values of x < 1 (or some other positive constant), yet it is totally plausible for \alpha + \beta < 1 in a Beta distribution. You can use pm.Bound as in:

ab = pm.Bound(pm.HalfFlat, lower=1)('ab', shape=2, testval=np.asarray([2., 2.]))
pm.Potential('p(a, b)', logp_ab(ab))

and that will help somewhat (and it also makes sense to want \alpha + \beta > 1).

I wound up getting some divergences myself even with the uniform parameterization, which can be dealt with in a similar manner by taking y = pm.Deterministic('y', 1./(yi + eps)). Ultimately you have to do something to make this prior integrable.