As I get serious about Bayesian modeling, I hit two roadblocks on implementing shrinkage priors. The first is applied - which led me to a conceptual concern.
I used prior predictive checks as a starting point.
Applied Concern: Consider the following model (assume data is standardized):
def laplace(X_train, y_train, model_type):
#Laplace priors- put more weight near 0
M = X_train.shape[1]
with pm.Model() as model:
λ = pm.Uniform('lambda',0,5)#Gamma('lambda', 1,1, shape=M) #, shape=M
β =pm.Laplace('β', mu=0.0, b=λ , shape=M)
β0 = pm.Normal("β0", 0, 10.)
σ = pm.HalfNormal("σ", 2.5)
pred = pm.Data("pred", X_train)
mu = pm.Deterministic('mu',β0 + pm.math.dot(pred, β))
obs = pm.Normal("obs",mu , σ, observed=y_train)
return model
The prior predictive check is done as follows - Y’s and X’s are all standardized. There are 50 covariates, as they are all mean centered, I pick one of the X’s at random and generate a range of x values between the min and max of that covariate. Then I add the intercept, multiply the coefficient with the covariate and generate prior based predictions. Implicitly the predictions are conditional on all other covariates measured at the mean i.e. 0.
That gives me the following graph.
Given that I know that y is in the range of (-2,2), I fix the variances/ scales for priors/hyperpriors effectively at random, till I get this.
Looks better. The new model is
def laplace(X_train, y_train, model_type):
#Laplace priors- put more weight near 0
M = X_train.shape[1]
with pm.Model() as model:
λ = pm.Uniform('lambda',0,0.25)#Gamma('lambda', 1,1, shape=M) #, shape=M
β_ =pm.Laplace('β_', mu=0.0, b=0.15 , shape=M)
β =pm.Deterministic('β', λ *β_)
β0 = pm.Normal("β0", 0, 0.85)
σ = pm.HalfNormal("σ",1)
pred = pm.Data("pred", X_train)
mu = pm.Deterministic('mu',β0 + pm.math.dot(pred, β))
obs = pm.Normal("obs",mu , σ, observed=y_train)
return model
Question 1 (Applied): There has to be a better way of adjusting priors - For example should we first focus on hyper priors , then the next tier of priors, - essentially tweaking each progressively till we get to the layer generating the outcome i.e. likelihood. The above examples encourage sparsity - so should I try to leave the sparsity inducing priors unchanged and go after the other parts of the model? Any reference on sequencing this would be awesome. ALso note - I have no other criterion in play to select between the two models - Neither specification has any divergences, poor Rhats etc.
Question 2 (Conceptual): My conceptual concern is how do we know we are not overfitting the model to the training data? When should I stop tweaking? When the prior predicitve checks exactly cover my observed outcomes (seems impossible) ? As a frequentist, I can toss all my hyperparameters into a search algorithm and run cross validation. In the bayesian case, given the relatively larger computation time, that is not an option (even with LOO - which doesnt seem to work well in particular cases - even thought the model itself is fine - no divergences, nice Rhats, beautiful energy plots). Or is this a secret sauce?