Zero-inflated Bounded Continuous Outcome

nshaganti · February 19, 2025, 6:27am

Hi Everyone! I’m trying to model a continuous outcome that is bounded between 0 and 1. It is actually revenue scaled between 0 and 1 using the MinMaxScaler. In addition, I have a lot of users who don’t spend at all (i.e., their outcome is zero). You can see the distribution attached for reference. I tried to model entire dataset as Bernoulli and then model non-zeros using Beta but it is extremely slow. Are there any other distributions I could use for this purpose? Appreciate any help. Thank you in advance.

zweli · February 19, 2025, 6:32pm

You are indicate that you used MinMax Scaler to transform your data. What scale was the original data? Could you provide example data or code simulating the data? If it is revenue, you could use a hurdle-gamma or hurdle-lognormal depending on the scale of the non-zero data.

So in Bambi, it would look something like this:

import bambi as bmb

data = some_pandas_dataframe

formula = 'revenue ~ covariate1 + covariate2'

model = bmb.Model(formula = formula, data=data, family='hurdle_gamma')

idata = model.fit()

nshaganti · February 19, 2025, 6:49pm

Thank you for the reply @zweli

As I mentioned, I’m modeling revenue and it could range from $0 to $6000+ with majority concentrated around $0 and heavily right skewed. I will look into Bambi.

zweli · February 19, 2025, 7:05pm

The hurdle-gamma should work well then. The parameters are defined here. If the tails are not being captured well, try the hurdle lognormal (or family=‘hurdle_lognormal’ in Bambi).

nshaganti · February 19, 2025, 8:09pm

I really appreciate the quick response @zweli

Will the hurdle gamma or lognormal taking into account the boundedness of my data? In this case, I want to model my outcome on 0-1 scale and not on raw scale.

zweli · February 19, 2025, 9:18pm

No, it won’t. But why do you want to keep it on the transformed scale?

nshaganti · February 19, 2025, 11:24pm

We have a downstream process that consumes the posterior means estimated from this process and the magnitude matters for them. I think the magnitude of those features changes if I change the target to it’s original scale. However, I can model in raw target scale and simply divide the mean and sd estimated by the range (which is 6000 in my case), correct? Please let me know if this adjustment would be valid post-hoc.

zweli · February 20, 2025, 1:09am

With the hurdle-gamma your posterior will have psi, alpha and beta means. So, I think it makes more sense to use your posterior predictive, which will generate data on the response scale using the posterior, and transform that with MinMax scaler.

zweli · February 20, 2025, 4:41am

Something like this should work:

# import libraries
import arviz as az
import bambi as bmb
import pymc as pm
import pandas as pd

# generate fake revenue data according to your specification
dist = pm.HurdleGamma.dist(psi=.30, mu=3000, sigma=750)
draws = pm.draw(dist, draws=5000, random_seed=1).round(2)
data = pd.DataFrame({'revenue':draws})

# define simple Bambi model
model = bmb.Model(formula='revenue ~ 1', 
                  data=data, 
                  family='hurdle_gamma',
                  )

# fit the model
idata = model.fit()

# generate posterior predictive
model.predict(idata, kind='response')

# scale the data
posterior_predictive = idata.posterior_predictive.revenue
posterior_predictive_min = posterior_predictive.min(dim=('chain', 'draw')) 
posterior_predictive_max = posterior_predictive.max(dim=('chain', 'draw')) 
posterior_predictive_std = ( posterior_predictive - posterior_predictive_min ) / ( posterior_predictive_max - posterior_predictive_min )
min, max = 0,1

# here is the scaled posterior predictive
posterior_predictive_scaled = posterior_predictive_std * (max - min) + min

nshaganti · February 20, 2025, 11:52am

Thank you for taking time to share the code.

Actually, once model is fitted good enough, I’m not interested in the posterior samples. I care more about which predictors (and the magnitude of their coefficient) played a role in that posterior prediction. For instance, I have about 20 predictors and I’m using them to come up with mu, sigma parametrization.

zweli · February 20, 2025, 2:24pm

I see. So, you can just use the posterior using:

model.predict(idata, kind='response_params')

Topic		Replies	Views
Can't fit binomial hierarchical model on proportion data version agnostic bambi , modeling	3	370	April 11, 2023
Plateau data: Initial evaluation of model at starting point failed!	3	22	January 19, 2025
Modeling Zero-Inflation on continuous outcome Questions	6	1843	November 11, 2024
Random variable transformations and Bambi version agnostic bambi	10	105	March 10, 2025
Highly correlated variables v5 bambi , modeling	3	515	January 3, 2023

Zero-inflated Bounded Continuous Outcome

Related topics