TwoFer: Prior Predictive Checks - An Applied Question & a Conceptual Question

As I get serious about Bayesian modeling, I hit two roadblocks on implementing shrinkage priors. The first is applied - which led me to a conceptual concern.

I used prior predictive checks as a starting point.

Applied Concern: Consider the following model (assume data is standardized):


def laplace(X_train, y_train, model_type):
    #Laplace priors- put more weight near 0
    
    M = X_train.shape[1]
    with pm.Model() as model:
        λ = pm.Uniform('lambda',0,5)#Gamma('lambda', 1,1, shape=M) #, shape=M
        β =pm.Laplace('β', mu=0.0, b=λ , shape=M)
        β0 = pm.Normal("β0", 0, 10.)
        σ = pm.HalfNormal("σ", 2.5)
        
        pred = pm.Data("pred", X_train)

        mu = pm.Deterministic('mu',β0 + pm.math.dot(pred, β))
        obs = pm.Normal("obs",mu , σ, observed=y_train)
    return model

The prior predictive check is done as follows - Y’s and X’s are all standardized. There are 50 covariates, as they are all mean centered, I pick one of the X’s at random and generate a range of x values between the min and max of that covariate. Then I add the intercept, multiply the coefficient with the covariate and generate prior based predictions. Implicitly the predictions are conditional on all other covariates measured at the mean i.e. 0.

That gives me the following graph.

image

Given that I know that y is in the range of (-2,2), I fix the variances/ scales for priors/hyperpriors effectively at random, till I get this.

image

Looks better. The new model is

def laplace(X_train, y_train, model_type):
    #Laplace priors- put more weight near 0
    
    M = X_train.shape[1]
    with pm.Model() as model:
        λ = pm.Uniform('lambda',0,0.25)#Gamma('lambda', 1,1, shape=M) #, shape=M
        β_ =pm.Laplace('β_', mu=0.0, b=0.15 , shape=M)
        β =pm.Deterministic('β', λ *β_)
        β0 = pm.Normal("β0", 0, 0.85)
        
        σ = pm.HalfNormal("σ",1)
        
        pred = pm.Data("pred", X_train)

        mu = pm.Deterministic('mu',β0 + pm.math.dot(pred, β))
        obs = pm.Normal("obs",mu , σ, observed=y_train)
        
        
    return model

Question 1 (Applied): There has to be a better way of adjusting priors - For example should we first focus on hyper priors , then the next tier of priors, - essentially tweaking each progressively till we get to the layer generating the outcome i.e. likelihood. The above examples encourage sparsity - so should I try to leave the sparsity inducing priors unchanged and go after the other parts of the model? Any reference on sequencing this would be awesome. ALso note - I have no other criterion in play to select between the two models - Neither specification has any divergences, poor Rhats etc.

Question 2 (Conceptual): My conceptual concern is how do we know we are not overfitting the model to the training data? When should I stop tweaking? When the prior predicitve checks exactly cover my observed outcomes (seems impossible) ? As a frequentist, I can toss all my hyperparameters into a search algorithm and run cross validation. In the bayesian case, given the relatively larger computation time, that is not an option (even with LOO - which doesnt seem to work well in particular cases - even thought the model itself is fine - no divergences, nice Rhats, beautiful energy plots). Or is this a secret sauce?

1 Like