with pm.Model() as model:
# prior with the true sigma
mu = pm.Normal("mu", mu=0, sigma=true_sigma_of_generating_process)
# to hand over the fixed sigma to the likelihood
# we have to use a PyMC data type
sigma = pm.Data("sigma", true_sigma_of_generating_process)
# likelihood
likelihood = pm.Normal("obs", mu=mu, sigma=sigma, observed=observed_data)
# sampling
idata = pm.sample(1000, tune=500, return_inferencedata=True, progressbar=False)
What I do not understand about this plot is that the generated Normal distributions do have different heights. I would have assumed that sigma must be fixed, so all generated posterior predictive Normals must have the same sigma and therefore the same height.
Obviously, sigma is variing here. Why is this the case?
Sigma is not varying. As you said, you have fixed it to be exactly true_sigma_of_generating_process = 2. However, you are not plotting obs (which is normal by definition). Instead, you are plotting posterior draws from obs. In fact, you’re not even plotting posterior draws from obs, you are plotting a KDE of posterior draws from obs. So what you are seeing is an approximation (KDE) of an approximation (draws from a normal distribution) of obs. If you want to see something closer to what you expect, you can just take the one free parameter from your model (mu) and pump it through the true data generating process (the process that is also reflected in your model):
mus = az.extract(idata, var_names="mu", num_samples=100)
x = np.linspace(-5, 15, 100)
for mu in mus:
plt.plot(x, stats.norm.pdf(x, mu, true_sigma_of_generating_process), c="b")
plt.show()
Could you please explain to me what you mean with “obs”? I saw such a component in the posterior_predictive group of InferenceData, but I actually didn’t understand what is represented in it.
Alternatively, you can just crank up the amount of data you are modeling (i.e., the number of observations). This will make the various approximations “less approximate”:
I now thought a while about it and still do not get it.
In my understanding, to get a posterior predictive distribution by sampling, the process should go as follows:
First I draw a value of my estimated parameter from the posterior, i.e. in this case some mu.
I set this mu as the parameter of my likelihood, which in this case is the mean of a Normal. The sigma of this normal is fixed to 2 in my example.
I then sample a value of this Normal with the drawn mu value and sigma =2.
The result of this process is a new sampling value ^x for my PPD.
I repeat this process many times and stack up the different ^x values to a histogram (or finally a kde).
This histogram or kde is my first posterior predictive distribution.
When I then repeat this to get 100 PPDs and plot these kde’s, I should get the result of plot_ppc for the posterior_predictive.
Nowhere in this process I can see that the observed values play a role. They just contributed to the sampling of the posterior, which has happened before.
Let me be slightly more careful here. In your example, there are several notions of “the observed values”. You have observed_data and you have likelihood = pm.Normal("obs"... You then connect these, telling the model that the latter should be related to the former (using the observed kwarg). When you perform posterior predictive sampling, you request that your likelihood/obs parameter be resampled, asking what values could have been observed conditional on your posterior. For each posterior draw you get one vector of credible values of likelihood/obs.
Perhaps, a different example might help. Let’s use the same data generating process, but only generate a small number of observations:
These individual curves look nothing like normal distributions, but that’s because each one is a KDE based on 10 data points (each of which is a draw from an actual normal distribution).
Ah, I think I came now a step further. To make it more clear to me, I switched to the prior predictive, because this does not even have any free parameter, both mu and sigma are fixed, and nevertheless the plot_ppc function puts out very different heights of the kdes.
Also, what your explanations made clear to me is the effect of the sample size of the observations on the predictive samples. I guess the process is like follows:
Let’s assume that I observed only 10 data points and passed these to the likelihood.
Now I say idata.extend(pm.sample_prior_predictive(return_inferencedata=True, samples=33)).
This leads to 33 sets of “fake observed data”, each of size 10, because this was the size of my observed data.
As though the generating process now does not have any free parameters, it of course still has the randomness of a Normal.
Because I draw only 10 times each, I get 33 KDEs which look very different.
What I obviously still have to learn is to separate the terminology like “draws”, “samples” asf. For example, when I say pm.sample_prior_predictive(return_inferencedata=True, **samples**=33), the result in the prior_predictive group will be 33 “draws”, each of the size of the observed data that I passed to the likelihood. And when I see it correctly, I cannot change this size by any parameter of the sample_prior_predictive function, it is always determined by the size of the observed_data of the likelihood.
I simulated the plot_ppc function (for the prior_predictive data) by using something like
Aargh I understood. The predictive values for the prior predictive are not drawn from the prior (a Normal(0,2), but from the likelihood, in which a draw from the prior for mu is plugged in. So in rare cases, such a drawn mu can be quite far away from 0, and this is then the basis for the complete draw of 1000 samples.
I would like to ask one additional question in the context of the above discussion.
As we have seen above, there are actuelly three sources of uncertainty when generating the plot for the posterior predictive check:
The estimated mu is drawn from the posterior.
With this mu, a predictive sample is drawn from the likelihood.
The number of these draws is limited to the number of observations that was given to the likelihood for sampling of the posterior.
What I now asked myself: 1) and 3) are both strongly dependend on the number of observations. When the number of observations is low (like 10 in your example), the spread of the posterior for mu should be much wider then, say, the spread in case of 10000 observations.
So, the uncertainty that comes from the low number of observations is actually already represented in the posterior.
Isn’t then the additional limitation to 10 samples for generating one kde in the ppc_plot again emphasizing this uncertainty which is already represented, leading to an exaggeration of this uncertainty? Or is this justified by some well grounded considerations?
If you want to understand the uncertainty in the posterior of mu (and no other source of uncertainty), generating posterior predictive samples is not a good idea because the posterior predictive samples reflect the uncertainty in the posterior of mu and a bunch of other things (and there are more “other things” in more complex models).
If you are interested in asking your (now fitted) model what data sets could have been observed (other than the one you actually observed), then generating posterior predictive samples is what you want (though you may wish to avoid summarizing those posterior predictive samples with a KDE). This is not a question about mu per se, but the answer does (partially) reflect the posterior of mu.
These are both reasonable things to want, but they are different. So it really depends what are you looking for.
You are not forced to match the size of the observed data in your posterior predictive samples. The default behaviour is to have 1 posterior predictive sample per 1 posterior sample but the documentation explains how the API is more flexible: pymc.sample_posterior_predictive — PyMC dev documentation