Modeling Zero-Inflation on continuous outcome

I’m trying to model an outcome that’s effectively log-normal. However, there’s also a large “zero-inflation” on top of this. I put that in quotes because only familiar with models for zero-inflation for count data, so I’m not positive how to implement this in the context of a continuous distribution.

My thought was to model “is this value zero” via logistic regression and “if it’s non-zero, what will the value be” via a linear regression model. It’s not immediately clear to me how to tie the two together, though. (Or if there’s a more standard way to approach this problem)

The outcome:
image

Log-transform for values >0:
image

Here’s what my attempt looked like:

 with pm.Model() as lin_model:

    # Data 
    x_one = pm.Data("x_one", X_train["x_one"])
    x_two = pm.Data("x_two ", X_train["x_two"])
    x_three = pm.Data("x_three", X_train["x_three"])

    # Linear Model
    α = pm.Normal("α", 0, 10)
    β_one = pm.Normal("β_one", 0, 5)
    β_two = pm.Normal("β_two", 0, 5)
    β_three = pm.Normal("β_three", 0, 5)
    
    μ = pm.Deterministic(
        "μ",
        α + 
        β_one * x_one +
        β_two * x_two +
        β_three * x_three 
    )
    
    # Logistic Regression
    θ = pm.Deterministic("θ", pm.math.sigmoid(μ))
    log_response = pm.Bernoulli('y_logistic', p=θ)    

    # Linear Regression
    σ = pm.HalfCauchy("σ", 20)            
    likelihood = log_response * pm.Normal('y', 
        μ,
        σ,
        observed=y_train
    )
    
    model_trace = pm.sample(return_inferencedata=True)

But this fails with a “Wrong number of dimensions” error on the pm.Bernoulli line, probably because I’m using in this weird way. Any suggestions on approach here would be appreciated.

This is an off the cuff answer so don’t take it as true fact, but one trick may to just add 1 to everything and shift the distribution so you can model it

Are your values actually zero (hard boundary) or just close to zero (soft boundary)? Do the zeros have any informative value for your modeling goals or are they just inconsequential ”flukes"?

@ricardoV94 - It’s sort of both, but closer to a hard boundary - 70% of the data set is exactly 0. The zeros do have a very informative value for the modeling goal, they’re cases in which an initial threshold was not passed, so no value was accumulated.

I needed a similar likelihood (actually two) for an insurance loss-cost frequency-severity model.

You sometimes see these referred to as “zero-augmented” likelihoods, since the non-zero marginal dist isn’t defined at x=0. Though many people still call them zero-inflated.

In any case I’ve found you can treat them as a hard mixture-model, with a binary likelihood for zero/non-zero (Bernouilli etc) fitted using the full dataset, and your non-zero marginal diet of choice (here a lognormal) fitted using only the dataset with non-zeros in the target feature.

McElreath has a nice paper that uses this principle: “Using Multilevel Models to Estimate Variation in Foraging Returns: Effects of Failure Rate, Harvest Size, Age, and Individual Heterogeneity. Human Nature”, 25, 100-120. GitHub - rmcelreath/mcelreath-koster-human-nature-2014: Data and model fitting scripts from McElreath & Koster. 2014. Using Multilevel Models to Estimate Variation in Foraging Returns: Effects of Failure Rate, Harvest Size, Age, and Individual Heterogeneity. Human Nature, 25, 100-120.

1 Like

That’s great info - thanks @jonsedar!

1 Like

Hi every one
Try Zero-inflated gamma regression. The distribution approximate the raw data and also log-normal transformed data properly.