Beta Regression in pyMC3

I am trying to fit a regression model to estimate a percentage, 'sleepEf' (scaled to 0:1) from a single predictor variable 'tst' in pyMC3. Given that a percentage is bounded, I believe I should use a beta distributed outcome variable with a logit transform. Here is my model in pyMC3:

with pm.Model() as pooled_model:
    b_intr = pm.Normal('b_intr', mu=0.8, sd=100 ** 2)
    b_tst = pm.Normal('b_tst', mu=0.0, sd=100 ** 2)
    
    model_err = pm.HalfNormal('model_err', sd=50)

    # Expected value
    y_est = pm.math.invlogit(b_intr + \
            b_tst * data['tst'])

    # Data likelihood
    y_like = pm.Beta('y_like', mu=y_est, sd=model_err, observed=data['sleepEf'])

I have made sure that the sleepEf variable is never exactly 0 or 1. Here is a histogram of it:
Histogram for sleepEf

When I try to fit the MAP estimate and then sample I get errors:

with pooled_model:
    start = pm.find_MAP(fmin=sp.optimize.fmin_powell)
    hierarchical_trace = pm.sample(2000, step=pm.Metropolis(), start=start, tune=1000)

ValueError: Optimization error: max, logp or dlogp at max have
non-finite values. Some values may be outside of distribution support.
max: {‘b_intr’: array(3.3879289615458417), ‘b_tst’:
array(2.587928961545842)} logp: array(-inf) dlogp: array([
-2.58792896e-08, -2.58792896e-08])Check that 1) you don’t have hierarchical parameters, these will lead to points with infinite
density. 2) your distribution logp’s are properly specified. Specific
issues:

Have I incorrectly specified the model?

Also note that if I do not find_MAP before sampling, my parameter values never change from their prior values.

Thanks!

I have also asked this question here:

The problem is that the model_err is too large - after internal transform the alpha and beta in the pm.Beta become negative, which is outside of the range of support.

You can check the value of mu and sd by doing:

y_est.tag.test_value
model_err.tag.test_value

And transform the mu and sd to alpha and beta by doing:

kappa = mu * (1 - mu) / sd**2 - 1
alpha = mu * kappa
beta = (1 - mu) * kappa

This is what pymc3 does internally.

1 Like

Parameterizing the beta distribution via mu and sd is always a bit of a mess. It might work much better if you logit transform your dataset instead, and then use a normal likelihood.

2 Likes

Thanks guys! i got it to work with a smaller sd.

@junpenglao, maybe there should be something in the documentation on the limits of sd, (currently it only states sd > 0). Is this the kind of thing i would open a github issue for?

@aseyboldt, i think im going to stick with the beta regression because the units of my data are readily interpretable and its working now. However, im curious: i believe log transforming the data would solve the bounds issue on the mean, but would the model error be accurate when Normal(mu>0.8 or mu<0.2, sd=model_err), i.e. when mu gets close to the bounds?

1 Like

Yeah: a beta distribution with mean mu, the var should be smaller than mu(1 - mu). A PR to edit the docstring and also a better check of this would be great.

1 Like

PR is here: https://github.com/pymc-devs/pymc3/pull/2534
I just edited the docstring.

1 Like

@bdyetton Thanks for the PR!

Just for clarification: I didn’t mean a log transform, but a logit transform.
I’m not sure what “accurate” means in this context, a logit-normal model is a bit different from the beta distribution, but unless you have a good reason why the beta would somehow be the right one, I don’t see a clear advantage to using it. They are in fact quite similar, a normal distribution and the logit of a beta aren’t that different, the beta has a bit longer tails. But the interpretation of mean and std in the logit-normal case are I think a bit more natural.

4 Likes

Just chipping in with a “hooray!!!” for a PR to PyMC3, @bdyetton!

1 Like