General question about PPC and predictions

Hi @jessegrabowski thanks for jumping in. I am definitely interested to hear more on predictions with Bayesian models. It seems that often the discussion around Bayesian models is on their many strengths (e.g., thinking about the data generating process, prior assumptions, propagating uncertainty,…), but less on what to do with them when making predictions. On the other hand, something like xgboost is often discussed as a prediction powerhouse, especially in industry applications.

To answer your question, I guess I had thought about the mean of the PPC to represent the distribution of each sample mean (e.g., over ('chain', 'draw')) as opposed to the distribution of the mean. My background is engineering and I only came to statistics in the past few years during my search for models with more explanatory power, so this could just be a consequence of my lack of understanding of Bayesian models at a more fundamental level.

Taking the mean of the PPC should recover something close to the mu parameter used to define the data distribution, integrating out the Gaussian noise.

The first part makes sense to me since since it is directly related to our model assumptions:

    mu       = intercept + slope*xX
    y_hat    = pm.Normal("y_hat",mu=mu, sigma=sigma, ...

However, I think integrating out the Gaussian noise is maybe where I got lost. I assumed that the mean PPC would represent the mean of mu with noise sigma. Here is an additional figure from my first example in the linked notebook to maybe help clarify my confusion.

Both axes are the same–the one on the right is reduced y limits so we can see more clearly. The gray histogram is the data and the red line is the KDE of the data distribution. Each of the dark blue lines represent the the distribution for a single draw (not shown in legend). Baby blue is the distribution of all values over chain and draw (which matches what is shown by az.plot_ppc).

ppc_samples = az.extract(ppc, group="posterior_predictive", num_samples=num_samples)
ppc_samples["y_hat"].values.flatten()

Black is the distribution of each value of y_hat averaged over ('chain', 'draw')

ppc_samples["y_hat"].mean("sample").values

Orange is the distribution of sampled values centered at mean(mu) and mean(sigma). So if we compute the summary as summary = az.summary(idata), we get:

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
mu 9.992 0.03 9.932 10.045 0.0 0.0 7794.0 6157.0 1.0
sigma 0.977 0.022 0.936 1.017 0.0 0.0 8103.0 6080.0 1.0

Now we draw samples:

samples1 = pm.draw(pm.Normal.dist(summary.loc["mu"]["mean"],summary.loc["sigma"]["mean"]), draws=10_000)

The green line represents samples with values:

samples2 = pm.draw(pm.Normal.dist(summary.loc["mu"]["mean"],summary.loc["mu"]["sd"]), draws=10_000)

My point of confusion is that I expected the black line to be what the orange line is showing. The green make sense to me, although I am a bit confused on why green is so much more peaked than black.