Hi, I am really new to pymc and I tried to build a multivariate linear model with my beta coefficient being horeshoe prior with the following code and I could not manage to lower my errors to improve model’s performance, what would you suggest me to try to have better results ?
with pm.Model(coords={"predictors": X_train.columns.values}) as model:
X = pm.Data("X", X_train, dims=('obs', 'predictors'))
y = pm.Data("y", y_train, dims=('obs'))
# Prior on error SD
sigma = pm.Normal("sigma", 2300, 10)
# Global shrinkage prior
tau = pm.HalfStudentT("tau", 10, D0 / (D - D0) * sigma / np.sqrt(N))
# Local shrinkage prior
lam = pm.HalfStudentT("lam", 5, dims="predictors")
c2 = pm.InverseGamma("c2", 5, 1)
z = pm.Normal("z", 25, 10, dims="predictors")
# Shrunken coefficients
beta = pm.Deterministic("beta", z * tau * lam * at.sqrt(c2 / (c2 + tau**2 * lam**2)), dims="predictors")
# No shrinkage on intercept
beta0 = pm.Normal("beta0", 100, 25.0)
scores = pm.Normal("scores", beta0 + at.dot(X, beta), sigma, observed=y, dims=('obs',))
Training :
with model:
idata = pm.sample(draws=1000, tune=2000, chains=4, random_seed=42)
Using test observation to evaluate the model :
with model:
pm.set_data({"X": X_test})
posterior_test = pm.sample_posterior_predictive(idata, predictions=True, extend_inferencedata=False)
At the end my model has an r2 score of 0.72 and the predictions seems to follow the trend of my y_test but my RMSE is way to high (14000 for median sample at 30000). I know it depends a lot on my data but I am trying here to explore more bayesian modeling to improve accuracy.
About the data : I didn’t scale any data and I have both binary and continuous features (690 samples, 866 features)
Thanks !