Hello PyMC community!
I’ve been diligently studying the documentation and following various books that leverage PyMC in their examples. Now, as I’ve ventured into creating my first model, I find myself facing some uncertainties when it comes to assessing the quality of my model.
I’m currently engaged in a regression analysis task where the relationship is described as y = Ax + B. x and y are observed data, originate from real-world lab tests.
To kickstart my analysis, I initially divided my dataset into sets of four paired data points. For each set, I fitted a linear regression and then derived the A and B parameters. This process resulted in distributions for both A and B, with A following a normal distribution and B following a log-normal distribution.
With these distributions in hand, I turned to Scipy to estimate the parameters for the normal and log-normal distributions associated with the A and B regression parameters. Subsequently, I generated synthetic data, introducing random noise based on these estimated parameters.
These synthetic data were then split into two categories: prior knowledge and new data. With this setup, I proceeded to construct my PyMC model:
with pm.Model() as linear_model_s_t:
# 1 -Definir conhecimento a Priori:
#Intercepto
intercepto = pm.LogNormal('Intercepto', mu=mu_B, sigma=std_B)
#Declive
declive = pm.LogNormal('Declive', mu=mu_A, sigma=std_A)
#Desvio Padrão: Tenho dúvidas
sigma = pm.HalfNormal('sigma', sigma=10)
# 2 - estimar a média, que será o meu Y
# Y = Ax + B ->
mu = declive * x_new + intercepto
# 3 - Definir Likelihood
y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed = y_new)
# 4 - Modelo for sampling
trace = pm.sample(2000, tune=2000,chains=4, cores=2, random_seed=rng)
# 5 - Gerar amostras da priori
prior_pred = pm.sample_prior_predictive(samples=1000, random_seed=rng)
post_pred = pm.sample_posterior_predictive(trace,random_seed=rng)
However, it appears that my model might be overfitting the data, as indicated by the first row in the plot below. I’m seeking guidance on how I can enhance my model to achieve a better fit with my synthetic data.
I’m not sure if I’m interpreting these graphs correctly, so any advice or insights you can offer would be greatly appreciated!
Thank you!