# Beta or sigmoid regression that can handle 0 and 1

Hi folks,
If this isn’t pymc3 focused enough let me know and Ill remove.

I’d like to fit a sigmoid-shape set of values that range between [0,1]. Most beta regression or sigmoid fitting techniques for data that ranges between 0 and 1 normally rely on linearizing followed by linear regression - for example converting probabilities in range (0,1) to log odds using `logodds = log10(prob / (1-prob))` - which requires a ‘fudge factor’ to remove values that sit on exactly 0 and 1, since those values are undefined in that conversion. Similarly with beta regression, since the beta distribution is not defined at 0 or 1.

It seems that PPLs should be ideally placed to handle undefined values, which usually arise due to finite sampling rather than some error in the model. In addition, in my case I’m measuring ‘average precision’ which as far as I can see, is fine being exactly 0 or 1. Is anyone aware of some sort of reparameterization that will allow me to fit a sigmoid curve to these values?

Thanks!
PS the x axis is a ‘difficulty’ score, with the y axis being average precision that becomes very close to 1 at low difficulty and very close to 0 at high difficulty.

I don’t understand many things in your question. What is PPL? What is average precision? What is a fudge factor?

You can definitely run logistic regression with 0s and 1s in your data. I do this all the time!

Sorry if I got a bit carried away there with terminology. PPL is referring to probabilistic programming languages like PyMC3, but really Im referring to bayesian approaches that shine in cases of finite sampling.

Average precision is a scoring metric that approximates the area under the precision recall curve, see https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html for a primer. It ranges from 0 to 1, inclusive.

If you’re interested in why we use a fudge factor when doing regression in the range (0,1) check the paper ‘A Better Lemon Squeezer? Maximum-Likelihood Regression With Beta-Distributed Dependent Variables’ - basically it is to avoid cases where the link function is undefined.

Not interested in logistic regression here - there’s no binary outputs. The dependent variable continuously ranges between 0 and 1, inclusive.

OK, thanks for the extra detail.

I think that you still require a “fudge factor” as you call it, even in Bayesian models. If you look at the code for such transformations in pymc3, for example `invlogit` or `sigmoid`, they do contain fudge factors (specifically `eps=2.220446049250313e-16`).

I am not aware of any reparametrization (other than adding a fudge factor). But why is this important? If your model concludes that the average is .9999 for the largest predictor, isn’t that good enough? I would imagine you have many more sources of noise in your data generating/collecting process that are larger than that .0001 you are focusing on.