Posterior predictive distribution

To generate data from the posterior distributions (e.g., to run a posterior predictive check), I learnt that we should sample the combinations of parameters’ value from the trace, and then draw random data from the distribution described by such parameters. It is tempting, but wrong, to sample the parameters from the marginals distributions, as it is likely that the parameters are correlated (I guess it would be okay if the parameters were not correlated–correct me if I am wrong–).

Today I was skimming a paper by Kruschke “Bayesian data analysis. Wiley Interdisciplinary Reviews: Cognitive Science 1.5 (2010): 658-676.” and I came across this paragraph in the section Can the Effect be Replicated? A Natural Bayesian Answer:

To generate data from the posterior, we start by randomly selecting a believable baseline value (from the upper left distribution in Figure 4), along with its believable treatment effect (from the pertinent treatment distribution in Figure 4), and its subject effect (not shown in Figure 4), and then randomly selecting RTs based on those believable parameter values.The posterior thereby implies what believable data should look like, incorporating our uncertainty regarding the underlying parameter values…

Being Figure 4 the plot of the marginals, I feel like the reader could interpret this paragraph as a suggestion to sample combinations of parameters values from the marginals. But I just skimmed the article, so I will read it more thoroughly later. In the mean time I got to think and I would like to ask you:

  1. Could the posterior predictive distribution be generated from a combination of parameters in the marginals (for example by randomly sampling parameter values from the HPI–highest density interval)?

  2. (similar to the question before) If I read a paper and I am given only the marginals distributions of the parameters (i.e., I do not have the trace), can I make use of these values for generating credible samples from the posterior?

My answer to question 1) would be “No, we should not generate a posterior predictive distributions from the marginals, unless the parameters are not correlated”. For question 2) my answer would be “If I am given only the marginals, I could generate only a posterior predictive distribution from the point estimate (e.g., the mean of median of the marginals). This would give me the maximum likelihood estimation”.

To me, it really depends on what you are trying to sample from, ie what exactly is the target posterior predictive distribution.
In particular, you have y \sim \pi(.) where \pi(.) could be factorized into a conditional via the product rule. Say you have a linear regression where y \sim \text{Normal}(X\beta, \sigma), you do need the joint posterior distribution of \beta, \sigma to generate accurate posterior predictive samples, but the hyperparameters of \beta, \sigma is not needed, which means you can take the marginal of them - in practice, it would mean using the MCMC samples of \beta, \sigma is sufficient.

So i would say no to both of your question above, unless the posterior samples are mostly independent as you noted. At the very least, you need the joint posterior from the lowest hierarchical.

Thank you for your input! But I am not sure I fully understand the followinf part of your answer:

Would you be so kind to elaborate?

For example, if you have

mu0 = pm.Normal('mu0', 0, 10.)
sd0 = pm.HalfNormal('sd0', 10.)
beta = pm.Normal('beta', mu0, sd0, shape=n_predictor)
...
obs = pm.Normal('y', X.dot(beta), sd, observed=y)

you just need the joint posterior of beta and sd to get your posterior predictive distribution, as it effectively marginalised over mu0 and sd0

I understand. Thank you.
And the mode of the marginals is not necessarily the MAP of the joint posterior, right? From what I see, the find_MAP function in PyMC3 gives values that are close enough to the mode of each marginal, but not the same (sometimes quite different actually).
I have seen several examples (e.g., the hierarchical modelling example in the PyMC3 documentation) in which the means of the marginals are taken as reference values for the parameters in the regression model:


In the figure, the black dash line is created by:

bp = varying_intercept_trace[a].mean(axis=0)
mp = varying_intercept_trace[b].mean()

I am then wondering whether the line created by taking the mean of each parameter is a reasonable summary of the model.

That’s right, for example you can see this in Copula (same marginals, different global maximum https://github.com/tensorflow/probability/blob/master/tensorflow_probability/examples/jupyter_notebooks/Gaussian_Copula.ipynb)

It is reasonable if you only care about the expectation - afterall what it plots is the expected slope and intercept. Of course if the posterior is highly correlated the result could be biased.

1 Like

And what a coincidence that a new post by twiecki is about copulas. But I do not like the cliffhanger at the end, I wanted to see the PyMC3 example :stuck_out_tongue:

1 Like