Zero-inflated Bounded Continuous Outcome

Hi Everyone! I’m trying to model a continuous outcome that is bounded between 0 and 1. It is actually revenue scaled between 0 and 1 using the MinMaxScaler. In addition, I have a lot of users who don’t spend at all (i.e., their outcome is zero). You can see the distribution attached for reference. I tried to model entire dataset as Bernoulli and then model non-zeros using Beta but it is extremely slow. Are there any other distributions I could use for this purpose? Appreciate any help. Thank you in advance.

You are indicate that you used MinMax Scaler to transform your data. What scale was the original data? Could you provide example data or code simulating the data? If it is revenue, you could use a hurdle-gamma or hurdle-lognormal depending on the scale of the non-zero data.

So in Bambi, it would look something like this:

import bambi as bmb

data = some_pandas_dataframe

formula = 'revenue ~ covariate1 + covariate2'

model = bmb.Model(formula = formula, data=data, family='hurdle_gamma')

idata = model.fit()

1 Like

Thank you for the reply @zweli

As I mentioned, I’m modeling revenue and it could range from $0 to $6000+ with majority concentrated around $0 and heavily right skewed. I will look into Bambi.

The hurdle-gamma should work well then. The parameters are defined here. If the tails are not being captured well, try the hurdle lognormal (or family=‘hurdle_lognormal’ in Bambi).

I really appreciate the quick response @zweli

Will the hurdle gamma or lognormal taking into account the boundedness of my data? In this case, I want to model my outcome on 0-1 scale and not on raw scale.

No, it won’t. But why do you want to keep it on the transformed scale?

We have a downstream process that consumes the posterior means estimated from this process and the magnitude matters for them. I think the magnitude of those features changes if I change the target to it’s original scale. However, I can model in raw target scale and simply divide the mean and sd estimated by the range (which is 6000 in my case), correct? Please let me know if this adjustment would be valid post-hoc.

With the hurdle-gamma your posterior will have psi, alpha and beta means. So, I think it makes more sense to use your posterior predictive, which will generate data on the response scale using the posterior, and transform that with MinMax scaler.

1 Like

Something like this should work:

# import libraries
import arviz as az
import bambi as bmb
import pymc as pm
import pandas as pd

# generate fake revenue data according to your specification
dist = pm.HurdleGamma.dist(psi=.30, mu=3000, sigma=750)
draws = pm.draw(dist, draws=5000, random_seed=1).round(2)
data = pd.DataFrame({'revenue':draws})

# define simple Bambi model
model = bmb.Model(formula='revenue ~ 1', 
                  data=data, 
                  family='hurdle_gamma',
                  )

# fit the model
idata = model.fit()

# generate posterior predictive
model.predict(idata, kind='response')

# scale the data
posterior_predictive = idata.posterior_predictive.revenue
posterior_predictive_min = posterior_predictive.min(dim=('chain', 'draw')) 
posterior_predictive_max = posterior_predictive.max(dim=('chain', 'draw')) 
posterior_predictive_std = ( posterior_predictive - posterior_predictive_min ) / ( posterior_predictive_max - posterior_predictive_min )
min, max = 0,1

# here is the scaled posterior predictive
posterior_predictive_scaled = posterior_predictive_std * (max - min) + min

Thank you for taking time to share the code.

Actually, once model is fitted good enough, I’m not interested in the posterior samples. I care more about which predictors (and the magnitude of their coefficient) played a role in that posterior prediction. For instance, I have about 20 predictors and I’m using them to come up with mu, sigma parametrization.

I see. So, you can just use the posterior using:

model.predict(idata, kind='response_params')