Posterior predictive check question

Hi,

I have a really simple toy model which only estimates the mean of a Normal with a fixed sigma.

The data generating process is:

true_mean_of_generating_process = 5
true_sigma_of_generating_process = 2

observed_data = np.random.normal(loc=true_mean_of_generating_process,
                                 scale=true_sigma_of_generating_process,
                                 size=100)

The model then is:

with pm.Model() as model:

    # prior with the true sigma
    mu = pm.Normal("mu", mu=0, sigma=true_sigma_of_generating_process)

    # to hand over the fixed sigma to the likelihood
    # we have to use a PyMC data type
    sigma = pm.Data("sigma", true_sigma_of_generating_process)

    # likelihood
    likelihood = pm.Normal("obs", mu=mu, sigma=sigma, observed=observed_data)

    # sampling
    idata = pm.sample(1000, tune=500, return_inferencedata=True, progressbar=False)

After the sampling, I plot the posterior ppc by:

az.plot_ppc(idata, group="posterior", figsize=std_figsize, textsize=text_size, num_pp_samples=200)

I get the following plot:

What I do not understand about this plot is that the generated Normal distributions do have different heights. I would have assumed that sigma must be fixed, so all generated posterior predictive Normals must have the same sigma and therefore the same height.

Obviously, sigma is variing here. Why is this the case?

Best regards,
Matthias

Sigma is not varying. As you said, you have fixed it to be exactly true_sigma_of_generating_process = 2. However, you are not plotting obs (which is normal by definition). Instead, you are plotting posterior draws from obs. In fact, you’re not even plotting posterior draws from obs, you are plotting a KDE of posterior draws from obs. So what you are seeing is an approximation (KDE) of an approximation (draws from a normal distribution) of obs. If you want to see something closer to what you expect, you can just take the one free parameter from your model (mu) and pump it through the true data generating process (the process that is also reflected in your model):

mus = az.extract(idata, var_names="mu", num_samples=100)
x = np.linspace(-5, 15, 100)

for mu in mus:
    plt.plot(x, stats.norm.pdf(x, mu, true_sigma_of_generating_process), c="b")
plt.show()

I actually did what you said before, and that is why this point became so obvious to me:


Could you please explain to me what you mean with “obs”? I saw such a component in the posterior_predictive group of InferenceData, but I actually didn’t understand what is represented in it.

Alternatively, you can just crank up the amount of data you are modeling (i.e., the number of observations). This will make the various approximations “less approximate”:

observed_data = np.random.normal(loc=true_mean_of_generating_process,
                                 scale=true_sigma_of_generating_process,
                                 size=100_000)
...
az.plot_ppc(idata, group="posterior", num_pp_samples=100)

obs is what you define in your model here:

likelihood = pm.Normal("obs", mu=mu, sigma=sigma, observed=observed_data)

I now thought a while about it and still do not get it.

In my understanding, to get a posterior predictive distribution by sampling, the process should go as follows:

  • First I draw a value of my estimated parameter from the posterior, i.e. in this case some mu.
  • I set this mu as the parameter of my likelihood, which in this case is the mean of a Normal. The sigma of this normal is fixed to 2 in my example.
  • I then sample a value of this Normal with the drawn mu value and sigma =2.
  • The result of this process is a new sampling value ^x for my PPD.
  • I repeat this process many times and stack up the different ^x values to a histogram (or finally a kde).

This histogram or kde is my first posterior predictive distribution.

When I then repeat this to get 100 PPDs and plot these kde’s, I should get the result of plot_ppc for the posterior_predictive.

Nowhere in this process I can see that the observed values play a role. They just contributed to the sampling of the posterior, which has happened before.

What do I misunderstand?

That sounds correct to me.

Let me be slightly more careful here. In your example, there are several notions of “the observed values”. You have observed_data and you have likelihood = pm.Normal("obs"... You then connect these, telling the model that the latter should be related to the former (using the observed kwarg). When you perform posterior predictive sampling, you request that your likelihood/obs parameter be resampled, asking what values could have been observed conditional on your posterior. For each posterior draw you get one vector of credible values of likelihood/obs.

Perhaps, a different example might help. Let’s use the same data generating process, but only generate a small number of observations:

true_mean_of_generating_process = 5
true_sigma_of_generating_process = 2

observed_data = np.random.normal(loc=true_mean_of_generating_process,
                                 scale=true_sigma_of_generating_process,
                                 size=10)

Now the PPC procedure generates something like this:


These individual curves look nothing like normal distributions, but that’s because each one is a KDE based on 10 data points (each of which is a draw from an actual normal distribution).

Does that help clarify at all?

Ah, I think I came now a step further. To make it more clear to me, I switched to the prior predictive, because this does not even have any free parameter, both mu and sigma are fixed, and nevertheless the plot_ppc function puts out very different heights of the kdes.

Also, what your explanations made clear to me is the effect of the sample size of the observations on the predictive samples. I guess the process is like follows:

  • Let’s assume that I observed only 10 data points and passed these to the likelihood.
  • Now I say idata.extend(pm.sample_prior_predictive(return_inferencedata=True, samples=33)).
  • This leads to 33 sets of “fake observed data”, each of size 10, because this was the size of my observed data.
  • As though the generating process now does not have any free parameters, it of course still has the randomness of a Normal.
  • Because I draw only 10 times each, I get 33 KDEs which look very different.

What I obviously still have to learn is to separate the terminology like “draws”, “samples” asf. For example, when I say pm.sample_prior_predictive(return_inferencedata=True, **samples**=33), the result in the prior_predictive group will be 33 “draws”, each of the size of the observed data that I passed to the likelihood. And when I see it correctly, I cannot change this size by any parameter of the sample_prior_predictive function, it is always determined by the size of the observed_data of the likelihood.

I simulated the plot_ppc function (for the prior_predictive data) by using something like

import seaborn as sns
fig, ax = plt.subplots(1,1, figsize=std_figsize)
sns.kdeplot(pd.DataFrame(idata.prior_predictive["obs"].values[0][1]), ax=ax)
sns.kdeplot(pd.DataFrame(idata.prior_predictive["obs"].values[0][2]), ax=ax)
sns.kdeplot(pd.DataFrame(idata.prior_predictive["obs"].values[0][3]), ax=ax)

which then shows this effect of varying kde shapes

Thanks for your help!

After looking again into some singular draws from the prior predictive, I have to add a question.

I now recreated the model with 1000 observations. Prior is still a normal with mean 0 and sigma 2.

When I now to plot_ppc, the following is the result:

Let us pull out one of these draws, which contains 1000 samples:

Seeing, this, I realized that these cannot be 1000 independent samples of a Normal(0, 2).

It somehow looks like the sampler “got stuck” in the area of a mean = 5 Normal. I still don’t understand how this can happen.

Aargh I understood. The predictive values for the prior predictive are not drawn from the prior (a Normal(0,2), but from the likelihood, in which a draw from the prior for mu is plugged in. So in rare cases, such a drawn mu can be quite far away from 0, and this is then the basis for the complete draw of 1000 samples.

Sorry, this needed some time again!

1 Like

Hi @cluhmann ,

I would like to ask one additional question in the context of the above discussion.

As we have seen above, there are actuelly three sources of uncertainty when generating the plot for the posterior predictive check:

  1. The estimated mu is drawn from the posterior.
  2. With this mu, a predictive sample is drawn from the likelihood.
  3. The number of these draws is limited to the number of observations that was given to the likelihood for sampling of the posterior.

What I now asked myself: 1) and 3) are both strongly dependend on the number of observations. When the number of observations is low (like 10 in your example), the spread of the posterior for mu should be much wider then, say, the spread in case of 10000 observations.

So, the uncertainty that comes from the low number of observations is actually already represented in the posterior.

Isn’t then the additional limitation to 10 samples for generating one kde in the ppc_plot again emphasizing this uncertainty which is already represented, leading to an exaggeration of this uncertainty? Or is this justified by some well grounded considerations?

Best regards
Matthias

I would put it this way.

If you want to understand the uncertainty in the posterior of mu (and no other source of uncertainty), generating posterior predictive samples is not a good idea because the posterior predictive samples reflect the uncertainty in the posterior of mu and a bunch of other things (and there are more “other things” in more complex models).

If you are interested in asking your (now fitted) model what data sets could have been observed (other than the one you actually observed), then generating posterior predictive samples is what you want (though you may wish to avoid summarizing those posterior predictive samples with a KDE). This is not a question about mu per se, but the answer does (partially) reflect the posterior of mu.

These are both reasonable things to want, but they are different. So it really depends what are you looking for.

2 Likes

You are not forced to match the size of the observed data in your posterior predictive samples. The default behaviour is to have 1 posterior predictive sample per 1 posterior sample but the documentation explains how the API is more flexible: pymc.sample_posterior_predictive — PyMC dev documentation

1 Like