What is the suitable regression model for proportions data with 0 and 1 values?

Mustapha_Momoh · June 13, 2024, 5:37am

Hello PyMC Community,

I am analyzing a dataset for Regression Discontinuity Design where the dependent variable, y, represents the fraction of students at each unique score who rated their experience of taking an exam remotely above 4, i.e., above ‘somewhat satisfactory’. The independent variable, x, corresponds to these unique scores. However, my data includes exact 0s and 1s, for instances where all students at a performance level either rated below or entirely above 4. I have 2 questions:

Given that y can be exactly 0 and 1 as well as fractional values in between, it is suitable to consider this as a straightforward probability distribution? If so, are there specific transformations you would recommend for y so I can model this using a Beta regression model? Currently, my beta regression model does not converge due to presence of these boundary values.
If you think it is a straightforward probability distribution, what regression model would best suit this data, especially given the presence of 0 and 1 values? Perhaps a Zero-and-One inflated Beta distribution? If so, is there an existing PyMC tutorial on such model that might help?

I appreciate any insights or suggestions.

I have attached a simulated data that illustrates the issue below. The independent variable i.e. test_performance values have been centered at 0 based on an observed discontinuity at 4. The threshold variable represent treatment assignment. Fractions is the dependent variable
simulated.csv (1.2 KB)

jessegrabowski · June 13, 2024, 7:27am

You could try a LogisticNormal, which is expit(Normal). It lies between zero and one, and admits boundary values. It’s also parameterized by mu and sigma, which might be more familiar to work with than the beta distribution parameters (the scale doesn’t influence the location).

ricardoV94 · June 13, 2024, 10:05am

For the zeros-ones you can either model them separately, or create a Mixture similar to how Hurdle Mixtures are implement in PyMC with components = [pm.DiracDelta.dist(0), pm.Truncated.dist(..., lower=0 + eps, upper=1-eps), pm.DiracDelta.dist(1)], where 0/1 +- eps are the smallest value you can register above 0 or below 1.

Either way, you will have to decide how to parametrize the weights, which inform the probability of observing 0, 1 or something in between.

Mustapha_Momoh · June 13, 2024, 11:06am

Thanks @jessegrabowski and @ricardoV94 for your suggestions. I will try them and report back.

Topic		Replies	Views
Zero One Inflated Beta Regression Questions	14	1167	January 26, 2024
Beta or sigmoid regression that can handle 0 and 1 Questions	4	970	March 7, 2020
Beta and dirichlet regression for continuous proportion data bambi	7	1135	September 15, 2024
Likelihood for regression problem in which the response is continous and zero-inflated, mixtures? any examples? v5	2	590	April 7, 2023
How to model observed percentages (bounded from 0 to 1) Questions	8	2743	January 3, 2018

What is the suitable regression model for proportions data with 0 and 1 values?

Related topics