What is the suitable regression model for proportions data with 0 and 1 values?

Hello PyMC Community,

I am analyzing a dataset for Regression Discontinuity Design where the dependent variable, y, represents the fraction of students at each unique score who rated their experience of taking an exam remotely above 4, i.e., above ‘somewhat satisfactory’. The independent variable, x, corresponds to these unique scores. However, my data includes exact 0s and 1s, for instances where all students at a performance level either rated below or entirely above 4. I have 2 questions:

  1. Given that y can be exactly 0 and 1 as well as fractional values in between, it is suitable to consider this as a straightforward probability distribution? If so, are there specific transformations you would recommend for y so I can model this using a Beta regression model? Currently, my beta regression model does not converge due to presence of these boundary values.

  2. If you think it is a straightforward probability distribution, what regression model would best suit this data, especially given the presence of 0 and 1 values? Perhaps a Zero-and-One inflated Beta distribution? If so, is there an existing PyMC tutorial on such model that might help?

I appreciate any insights or suggestions.

I have attached a simulated data that illustrates the issue below. The independent variable i.e. test_performance values have been centered at 0 based on an observed discontinuity at 4. The threshold variable represent treatment assignment. Fractions is the dependent variable
simulated.csv (1.2 KB)

You could try a LogisticNormal, which is expit(Normal). It lies between zero and one, and admits boundary values. It’s also parameterized by mu and sigma, which might be more familiar to work with than the beta distribution parameters (the scale doesn’t influence the location).

1 Like

For the zeros-ones you can either model them separately, or create a Mixture similar to how Hurdle Mixtures are implement in PyMC with components = [pm.DiracDelta.dist(0), pm.Truncated.dist(..., lower=0 + eps, upper=1-eps), pm.DiracDelta.dist(1)], where 0/1 +- eps are the smallest value you can register above 0 or below 1.

Either way, you will have to decide how to parametrize the weights, which inform the probability of observing 0, 1 or something in between.

1 Like

Thanks @jessegrabowski and @ricardoV94 for your suggestions. I will try them and report back.