Hi, I’m new to probabilistic programming and have a probably newbie question that I couldn’t quite figure out how to solve yet:
I’m modelling sales data based on a set of variables as a linear regression problem, and I based it on the ‘Robust Regression’ example. Like in the example I’ve used Normal distribution for the weights, and StudentT for the observed, and while this actually gives ok results, it’s not an accurate model since the sales numbers can only be positive or zero, so the sampling yields impossible traces.
I’ve tried reformulating it as a Poisson distribution but that didn’t work as well, and I’ve looked into censoring the data with pm.Bound/Potent, but I couldn’t quite figure out how that works.
So, long story short, how would you model regression coefficients and the observed variable that are bounded [0,+inf)?
For context, my simple model is below:
alpha = pm.Normal('alpha', mu=0, sd=10)
# coefficients for regression
beta = pm.Normal('beta', mu=0, sd=10, shape=D)
lam = pm.HalfCauchy('lam', beta=10, testval=1.)
# Expected value of outcome
mu = alpha + beta.dot(X_shared.T)
Y_obs = pm.StudentT('Y_obs', nu=1, mu=mu, lam=lam, observed=Y_shared)
Y_obs is distributed as student t, which permits any value in
You have a few different options:
- Use a distribution that only permits positive values (inv gamma comes to mind), however this is pretty non standard, and any econometrician would balk at it.
- Use a link function to map your output on
-inf, inf to
0, inf, in this case, softplus would be an appropriate choice.
A good starting point for you would be a logistic regression, where the regression model generates normally/student-t distributed logits on
-inf, inf, and the logistic function maps those logits to
0, 1, i.e. to probabilities. It’s the same idea here, except you’re mapping to a different support, using softplus instead of logistic.
Interpretation wise, think of your variable on the real axis as a latent variable that represents ‘sales potential’, positive values are just the predicted sales, but the model can express stronger and stronger degrees of ‘no one want to buy this’, then you’re transforming that latent ‘sales potential’ variable into an observes ‘sales’ variable.
You could also standardize your data (subtract mean, divide by standard deviation), which would make it no longer all positive.
Thank you for the suggestions! I will definetely look into aproaching it as a logistic regression problem. Right now I ended up using a Gamma distribution for the observed values which seem to work pretty ok.