I’m trying to model an outcome that’s effectively log-normal. However, there’s also a large “zero-inflation” on top of this. I put that in quotes because only familiar with models for zero-inflation for count data, so I’m not positive how to implement this in the context of a continuous distribution.
My thought was to model “is this value zero” via logistic regression and “if it’s non-zero, what will the value be” via a linear regression model. It’s not immediately clear to me how to tie the two together, though. (Or if there’s a more standard way to approach this problem)
The outcome:
Log-transform for values >0:
Here’s what my attempt looked like:
with pm.Model() as lin_model:
# Data
x_one = pm.Data("x_one", X_train["x_one"])
x_two = pm.Data("x_two ", X_train["x_two"])
x_three = pm.Data("x_three", X_train["x_three"])
# Linear Model
α = pm.Normal("α", 0, 10)
β_one = pm.Normal("β_one", 0, 5)
β_two = pm.Normal("β_two", 0, 5)
β_three = pm.Normal("β_three", 0, 5)
μ = pm.Deterministic(
"μ",
α +
β_one * x_one +
β_two * x_two +
β_three * x_three
)
# Logistic Regression
θ = pm.Deterministic("θ", pm.math.sigmoid(μ))
log_response = pm.Bernoulli('y_logistic', p=θ)
# Linear Regression
σ = pm.HalfCauchy("σ", 20)
likelihood = log_response * pm.Normal('y',
μ,
σ,
observed=y_train
)
model_trace = pm.sample(return_inferencedata=True)
But this fails with a “Wrong number of dimensions” error on the pm.Bernoulli
line, probably because I’m using in this weird way. Any suggestions on approach here would be appreciated.