How to interpret posterior/prior predictive checks

Hello PyMC community!

I’ve been diligently studying the documentation and following various books that leverage PyMC in their examples. Now, as I’ve ventured into creating my first model, I find myself facing some uncertainties when it comes to assessing the quality of my model.

I’m currently engaged in a regression analysis task where the relationship is described as y = Ax + B. x and y are observed data, originate from real-world lab tests.

To kickstart my analysis, I initially divided my dataset into sets of four paired data points. For each set, I fitted a linear regression and then derived the A and B parameters. This process resulted in distributions for both A and B, with A following a normal distribution and B following a log-normal distribution.

With these distributions in hand, I turned to Scipy to estimate the parameters for the normal and log-normal distributions associated with the A and B regression parameters. Subsequently, I generated synthetic data, introducing random noise based on these estimated parameters.

These synthetic data were then split into two categories: prior knowledge and new data. With this setup, I proceeded to construct my PyMC model:

with pm.Model() as linear_model_s_t:
    # 1 -Definir conhecimento a Priori:
    intercepto = pm.LogNormal('Intercepto', mu=mu_B, sigma=std_B)
    declive = pm.LogNormal('Declive', mu=mu_A, sigma=std_A)
    #Desvio Padrão: Tenho dúvidas
    sigma = pm.HalfNormal('sigma', sigma=10)
    # 2 - estimar a média, que será o meu Y
    # Y = Ax + B -> 
    mu = declive * x_new + intercepto
    # 3 - Definir Likelihood 
    y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed = y_new)
    # 4 -  Modelo for sampling
    trace = pm.sample(2000, tune=2000,chains=4, cores=2, random_seed=rng)
    # 5 - Gerar amostras da priori
    prior_pred = pm.sample_prior_predictive(samples=1000, random_seed=rng) 
    post_pred = pm.sample_posterior_predictive(trace,random_seed=rng)

However, it appears that my model might be overfitting the data, as indicated by the first row in the plot below. I’m seeking guidance on how I can enhance my model to achieve a better fit with my synthetic data.

I’m not sure if I’m interpreting these graphs correctly, so any advice or insights you can offer would be greatly appreciated!

Thank you!

Hi @J_V. Can you explain the plots a bit more and why you think they indicate overfitting?

I guess that in ax[0,0], the distribution is a little bit to the left, and in ax[0,1], the left part is also not in the gray zone. Is that reasoning correct?