General question about PPC and predictions

JAB · April 4, 2024, 2:54pm

Hi @jessegrabowski thanks for jumping in. I am definitely interested to hear more on predictions with Bayesian models. It seems that often the discussion around Bayesian models is on their many strengths (e.g., thinking about the data generating process, prior assumptions, propagating uncertainty,…), but less on what to do with them when making predictions. On the other hand, something like xgboost is often discussed as a prediction powerhouse, especially in industry applications.

To answer your question, I guess I had thought about the mean of the PPC to represent the distribution of each sample mean (e.g., over ('chain', 'draw')) as opposed to the distribution of the mean. My background is engineering and I only came to statistics in the past few years during my search for models with more explanatory power, so this could just be a consequence of my lack of understanding of Bayesian models at a more fundamental level.

Taking the mean of the PPC should recover something close to the mu parameter used to define the data distribution, integrating out the Gaussian noise.

The first part makes sense to me since since it is directly related to our model assumptions:

    mu       = intercept + slope*xX
    y_hat    = pm.Normal("y_hat",mu=mu, sigma=sigma, ...

However, I think integrating out the Gaussian noise is maybe where I got lost. I assumed that the mean PPC would represent the mean of mu with noise sigma. Here is an additional figure from my first example in the linked notebook to maybe help clarify my confusion.

Both axes are the same–the one on the right is reduced y limits so we can see more clearly. The gray histogram is the data and the red line is the KDE of the data distribution. Each of the dark blue lines represent the the distribution for a single draw (not shown in legend). Baby blue is the distribution of all values over chain and draw (which matches what is shown by az.plot_ppc).

ppc_samples = az.extract(ppc, group="posterior_predictive", num_samples=num_samples)
ppc_samples["y_hat"].values.flatten()

Black is the distribution of each value of y_hat averaged over ('chain', 'draw')

ppc_samples["y_hat"].mean("sample").values

Orange is the distribution of sampled values centered at mean(mu) and mean(sigma). So if we compute the summary as summary = az.summary(idata), we get:

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
mu	9.992	0.03	9.932	10.045	0.0	0.0	7794.0	6157.0	1.0
sigma	0.977	0.022	0.936	1.017	0.0	0.0	8103.0	6080.0	1.0

Now we draw samples:

samples1 = pm.draw(pm.Normal.dist(summary.loc["mu"]["mean"],summary.loc["sigma"]["mean"]), draws=10_000)

The green line represents samples with values:

samples2 = pm.draw(pm.Normal.dist(summary.loc["mu"]["mean"],summary.loc["mu"]["sd"]), draws=10_000)

My point of confusion is that I expected the black line to be what the orange line is showing. The green make sense to me, although I am a bit confused on why green is so much more peaked than black.

Topic		Replies	Views
How to get a smoother mean posterior predictive plot over long tails? v5	6	514	March 9, 2024
How to get postierior predictive mean in az.plot_ppc version agnostic modeling , arviz	1	574	July 7, 2023
Posterior predictive checks with Gaussian Process	4	1370	May 26, 2022
Posterior predictive sampling with data variance Questions	10	2484	September 14, 2018
Posterior predictive check question	13	173	November 18, 2024

General question about PPC and predictions

Related topics