Hello PyMC community!
I’ve been diligently studying the documentation and following various books that leverage PyMC in their examples. Now, as I’ve ventured into creating my first model, I find myself facing some uncertainties when it comes to assessing the quality of my model.
I’m currently engaged in a regression analysis task where the relationship is described as y = Ax + B. x and y are observed data, originate from real-world lab tests.
To kickstart my analysis, I initially divided my dataset into sets of four paired data points. For each set, I fitted a linear regression and then derived the A and B parameters. This process resulted in distributions for both A and B, with A following a normal distribution and B following a log-normal distribution.
With these distributions in hand, I turned to Scipy to estimate the parameters for the normal and log-normal distributions associated with the A and B regression parameters. Subsequently, I generated synthetic data, introducing random noise based on these estimated parameters.
These synthetic data were then split into two categories: prior knowledge and new data. With this setup, I proceeded to construct my PyMC model:
with pm.Model() as linear_model_s_t: # 1 -Definir conhecimento a Priori: #Intercepto intercepto = pm.LogNormal('Intercepto', mu=mu_B, sigma=std_B) #Declive declive = pm.LogNormal('Declive', mu=mu_A, sigma=std_A) #Desvio Padrão: Tenho dúvidas sigma = pm.HalfNormal('sigma', sigma=10) # 2 - estimar a média, que será o meu Y # Y = Ax + B -> mu = declive * x_new + intercepto # 3 - Definir Likelihood y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed = y_new) # 4 - Modelo for sampling trace = pm.sample(2000, tune=2000,chains=4, cores=2, random_seed=rng) # 5 - Gerar amostras da priori prior_pred = pm.sample_prior_predictive(samples=1000, random_seed=rng) post_pred = pm.sample_posterior_predictive(trace,random_seed=rng)
However, it appears that my model might be overfitting the data, as indicated by the first row in the plot below. I’m seeking guidance on how I can enhance my model to achieve a better fit with my synthetic data.
I’m not sure if I’m interpreting these graphs correctly, so any advice or insights you can offer would be greatly appreciated!