Posterior predictive check question

bayesian_padawan · November 16, 2024, 3:26pm

Hi,

I have a really simple toy model which only estimates the mean of a Normal with a fixed sigma.

The data generating process is:

true_mean_of_generating_process = 5
true_sigma_of_generating_process = 2

observed_data = np.random.normal(loc=true_mean_of_generating_process,
                                 scale=true_sigma_of_generating_process,
                                 size=100)

The model then is:

with pm.Model() as model:

    # prior with the true sigma
    mu = pm.Normal("mu", mu=0, sigma=true_sigma_of_generating_process)

    # to hand over the fixed sigma to the likelihood
    # we have to use a PyMC data type
    sigma = pm.Data("sigma", true_sigma_of_generating_process)

    # likelihood
    likelihood = pm.Normal("obs", mu=mu, sigma=sigma, observed=observed_data)

    # sampling
    idata = pm.sample(1000, tune=500, return_inferencedata=True, progressbar=False)

After the sampling, I plot the posterior ppc by:

az.plot_ppc(idata, group="posterior", figsize=std_figsize, textsize=text_size, num_pp_samples=200)

I get the following plot:

What I do not understand about this plot is that the generated Normal distributions do have different heights. I would have assumed that sigma must be fixed, so all generated posterior predictive Normals must have the same sigma and therefore the same height.

Obviously, sigma is variing here. Why is this the case?

Best regards,
Matthias

cluhmann · November 16, 2024, 5:04pm

Sigma is not varying. As you said, you have fixed it to be exactly true_sigma_of_generating_process = 2. However, you are not plotting obs (which is normal by definition). Instead, you are plotting posterior draws from obs. In fact, you’re not even plotting posterior draws from obs, you are plotting a KDE of posterior draws from obs. So what you are seeing is an approximation (KDE) of an approximation (draws from a normal distribution) of obs. If you want to see something closer to what you expect, you can just take the one free parameter from your model (mu) and pump it through the true data generating process (the process that is also reflected in your model):

mus = az.extract(idata, var_names="mu", num_samples=100)
x = np.linspace(-5, 15, 100)

for mu in mus:
    plt.plot(x, stats.norm.pdf(x, mu, true_sigma_of_generating_process), c="b")
plt.show()

bayesian_padawan · November 16, 2024, 5:40pm

I actually did what you said before, and that is why this point became so obvious to me:

Could you please explain to me what you mean with “obs”? I saw such a component in the posterior_predictive group of InferenceData, but I actually didn’t understand what is represented in it.

cluhmann · November 16, 2024, 5:42pm

Alternatively, you can just crank up the amount of data you are modeling (i.e., the number of observations). This will make the various approximations “less approximate”:

observed_data = np.random.normal(loc=true_mean_of_generating_process,
                                 scale=true_sigma_of_generating_process,
                                 size=100_000)
...
az.plot_ppc(idata, group="posterior", num_pp_samples=100)

cluhmann · November 16, 2024, 6:17pm

obs is what you define in your model here:

likelihood = pm.Normal("obs", mu=mu, sigma=sigma, observed=observed_data)

bayesian_padawan · November 16, 2024, 8:46pm

I now thought a while about it and still do not get it.

In my understanding, to get a posterior predictive distribution by sampling, the process should go as follows:

First I draw a value of my estimated parameter from the posterior, i.e. in this case some mu.
I set this mu as the parameter of my likelihood, which in this case is the mean of a Normal. The sigma of this normal is fixed to 2 in my example.
I then sample a value of this Normal with the drawn mu value and sigma =2.
The result of this process is a new sampling value ^x for my PPD.
I repeat this process many times and stack up the different ^x values to a histogram (or finally a kde).

This histogram or kde is my first posterior predictive distribution.

When I then repeat this to get 100 PPDs and plot these kde’s, I should get the result of plot_ppc for the posterior_predictive.

Nowhere in this process I can see that the observed values play a role. They just contributed to the sampling of the posterior, which has happened before.

What do I misunderstand?

cluhmann · November 16, 2024, 9:12pm

That sounds correct to me.

cluhmann · November 16, 2024, 9:37pm

Let me be slightly more careful here. In your example, there are several notions of “the observed values”. You have observed_data and you have likelihood = pm.Normal("obs"... You then connect these, telling the model that the latter should be related to the former (using the observed kwarg). When you perform posterior predictive sampling, you request that your likelihood/obs parameter be resampled, asking what values could have been observed conditional on your posterior. For each posterior draw you get one vector of credible values of likelihood/obs.

Perhaps, a different example might help. Let’s use the same data generating process, but only generate a small number of observations:

true_mean_of_generating_process = 5
true_sigma_of_generating_process = 2

observed_data = np.random.normal(loc=true_mean_of_generating_process,
                                 scale=true_sigma_of_generating_process,
                                 size=10)

Now the PPC procedure generates something like this:

These individual curves look nothing like normal distributions, but that’s because each one is a KDE based on 10 data points (each of which is a draw from an actual normal distribution).

Does that help clarify at all?

bayesian_padawan · November 17, 2024, 8:10am

Ah, I think I came now a step further. To make it more clear to me, I switched to the prior predictive, because this does not even have any free parameter, both mu and sigma are fixed, and nevertheless the plot_ppc function puts out very different heights of the kdes.

Also, what your explanations made clear to me is the effect of the sample size of the observations on the predictive samples. I guess the process is like follows:

Let’s assume that I observed only 10 data points and passed these to the likelihood.
Now I say idata.extend(pm.sample_prior_predictive(return_inferencedata=True, samples=33)).
This leads to 33 sets of “fake observed data”, each of size 10, because this was the size of my observed data.
As though the generating process now does not have any free parameters, it of course still has the randomness of a Normal.
Because I draw only 10 times each, I get 33 KDEs which look very different.

What I obviously still have to learn is to separate the terminology like “draws”, “samples” asf. For example, when I say pm.sample_prior_predictive(return_inferencedata=True, **samples**=33), the result in the prior_predictive group will be 33 “draws”, each of the size of the observed data that I passed to the likelihood. And when I see it correctly, I cannot change this size by any parameter of the sample_prior_predictive function, it is always determined by the size of the observed_data of the likelihood.

I simulated the plot_ppc function (for the prior_predictive data) by using something like

import seaborn as sns
fig, ax = plt.subplots(1,1, figsize=std_figsize)
sns.kdeplot(pd.DataFrame(idata.prior_predictive["obs"].values[0][1]), ax=ax)
sns.kdeplot(pd.DataFrame(idata.prior_predictive["obs"].values[0][2]), ax=ax)
sns.kdeplot(pd.DataFrame(idata.prior_predictive["obs"].values[0][3]), ax=ax)

which then shows this effect of varying kde shapes

Thanks for your help!

bayesian_padawan · November 17, 2024, 10:37am

After looking again into some singular draws from the prior predictive, I have to add a question.

I now recreated the model with 1000 observations. Prior is still a normal with mean 0 and sigma 2.

When I now to plot_ppc, the following is the result:

Let us pull out one of these draws, which contains 1000 samples:

Seeing, this, I realized that these cannot be 1000 independent samples of a Normal(0, 2).

It somehow looks like the sampler “got stuck” in the area of a mean = 5 Normal. I still don’t understand how this can happen.

bayesian_padawan · November 17, 2024, 11:53am

Aargh I understood. The predictive values for the prior predictive are not drawn from the prior (a Normal(0,2), but from the likelihood, in which a draw from the prior for mu is plugged in. So in rare cases, such a drawn mu can be quite far away from 0, and this is then the basis for the complete draw of 1000 samples.

Sorry, this needed some time again!

bayesian_padawan · November 18, 2024, 8:02pm

Hi @cluhmann ,

I would like to ask one additional question in the context of the above discussion.

As we have seen above, there are actuelly three sources of uncertainty when generating the plot for the posterior predictive check:

The estimated mu is drawn from the posterior.
With this mu, a predictive sample is drawn from the likelihood.
The number of these draws is limited to the number of observations that was given to the likelihood for sampling of the posterior.

What I now asked myself: 1) and 3) are both strongly dependend on the number of observations. When the number of observations is low (like 10 in your example), the spread of the posterior for mu should be much wider then, say, the spread in case of 10000 observations.

So, the uncertainty that comes from the low number of observations is actually already represented in the posterior.

Isn’t then the additional limitation to 10 samples for generating one kde in the ppc_plot again emphasizing this uncertainty which is already represented, leading to an exaggeration of this uncertainty? Or is this justified by some well grounded considerations?

Best regards
Matthias

cluhmann · November 18, 2024, 8:09pm

I would put it this way.

If you want to understand the uncertainty in the posterior of mu (and no other source of uncertainty), generating posterior predictive samples is not a good idea because the posterior predictive samples reflect the uncertainty in the posterior of mu and a bunch of other things (and there are more “other things” in more complex models).

If you are interested in asking your (now fitted) model what data sets could have been observed (other than the one you actually observed), then generating posterior predictive samples is what you want (though you may wish to avoid summarizing those posterior predictive samples with a KDE). This is not a question about mu per se, but the answer does (partially) reflect the posterior of mu.

These are both reasonable things to want, but they are different. So it really depends what are you looking for.

Genesis · November 18, 2024, 10:17pm

You are not forced to match the size of the observed data in your posterior predictive samples. The default behaviour is to have 1 posterior predictive sample per 1 posterior sample but the documentation explains how the API is more flexible: pymc.sample_posterior_predictive — PyMC dev documentation

Topic		Replies	Views
Posterior predictive checks with Gaussian Process	4	1308	May 26, 2022
Trouble understanding sample_ppc Questions	4	643	June 7, 2018
Interpretation of posterior predictive checks for a Gaussian Process v5 gaussian_process , modeling , arviz	11	1203	August 24, 2023
General question about PPC and predictions version agnostic modeling , arviz	17	551	April 5, 2024
How to use the posterior predictive distribution for checking a model from PyMC version agnostic arviz , model-checking	10	4261	March 14, 2023

Posterior predictive check question

Related topics