Hello! I’ve been taking baby steps in developing my first model, as I’ve posted questions about it previously here and here. Given the sensitivity of the application, I’m proceeding with caution. While I’ve successfully identified effective priors for my parameters using TruncateNormal or Normal distribution, I’m encountering a challenge with the likelihood. I have tried the lognormal, the TruncatedNormal, Normal, and StudentT. It seems that the StudentT is the best, as displayed in the image below, and I believe it’s true because it’s more robust for outliers, which the data sometimes has.
My problem is that I don’t know how to set the StudentT parameters. Therefore, I performed a sensitivity analysis on the StudentT parameter. I’m not a statistician (and I’m saddened that I didn’t learn Bayesian Inference at university; it needed me to enter the industry to realize how important it is — that’s life). So, I was following the robust regression from here, where they define the nu parameter for the t-student as an exponential. My reasoning was trying to figure out what this parameter does, so I did a loop modifying this parameter for the values kk = [0.1,1,5,10,20,30,50,100].
trace_lam = []
post_pred_lam = []
prior_pred_lam = []
for lam_vals in kk:
with model_t:
# Switch out the sigma
pm.set_data({"lam_test": lam_vals})
trace_i = pm.sample(2500, tune=2000,chains=4, cores=2, random_seed=rng,idata_kwargs={"log_likelihood": True})
trace_lam.append(trace_i)
prior_pred_lam.append(pm.sample_prior_predictive(samples=1000, random_seed=rng))
post_pred_lam.append(pm.sample_posterior_predictive(trace_i,extend_inferencedata=True,random_seed=rng))
and this what I got:
My point is: Does the t-student distribution always have a sharp peak? It seems that the parameter kk=50 is the one getting closer to the observed data, but even this is not close enough (I guess). How do I know if it’s close enough? Is it possible to constrain it to be only positive? I was watching this video that helped me a lot to understand my priors, and what I know is that 95% of my observed data (y) should be between 50 - 600 and should never be lower than 0 or higher than 4000. How do I introduce this information into my t-student distribution? I have tried the pymc.find_constrained_prior, but since it has 3 parameters and I must set one, I guess the one I’m fixing is not good, because when comparing this model is worse than my actual model.
This is my complete actual model, just in case:
with pm.Model() as model_t:
y_obs = pm.MutableData('Y',y_prior,dims='observation')
X = pm.MutableData('X',x_prior,dims='observation')
lam = pm.MutableData('lam_test',kk[0])
#Intercept
intercept = pm.TruncatedNormal('Intercept', mu=mu_B, sigma=std_B,lower=0)
#Slope
slope = pm.Normal('Slope', mu=mu_A, sigma=sigma_A)
sigma = pm.HalfNormal('sigma', sigma=sigma_y)
ν_ = pm.Exponential('ν_', 1/lam)
ν = pm.Deterministic('ν', ν_ + 1)
y_model = pm.Deterministic('y_model',slope * X + intercept,dims='observation')
#Likelihood
y = pm.StudentT('y', mu=y_model,
sigma=sigma, nu=ν, observed=y_obs)
trace_3 = pm.sample(3500, tune=2000,chains=4, cores=2, random_seed=rng,idata_kwargs={"log_likelihood": True})
prior_pred_3 = pm.sample_prior_predictive(samples=1000, random_seed=rng)
post_pred_3 = pm.sample_posterior_predictive(trace_3,extend_inferencedata=True,random_seed=rng)