How to use the posterior predictive distribution for checking a model from PyMC

I’m a little confused because this question is not of great scientific interest and rather relates to a lack of knowledge on my part.

I am trying to understand how to use a posterior predictive check (PPC) after building a bayesian model using the PyMC library. I’ve never done this before and I don’t quite understand how I should proceed…

Concretely, I would like to be able to generate a large number of samples from the posterior predictive distribution and compare them to the sample I used to build the model, using a simple criterion: the maximum value of the samples. I refer for this to Ben Lambert’s presentation video (What is a posterior predictive check and why is it useful? - YouTube) (see at 8:15) where he presents the following graph, which perfectly expresses what I would like to get:

plot of maximum values

So I built a simple model, where I gave myself a sample of 20 values that are supposed to come from a normal distribution with mean mu and standard deviation sigma that I am trying to estimate. (note that I gave myself a mu and a sigma to generate the sample which is the basis of the model).

I then built the code referring to various examples and included the sample_prior_predictive and sample_posterior_predictive instructions according to the information given by the API — PyMC 5.1.1 documentation (API — PyMC 5.1.2 documentation) in the “samplers” paragraph The code works (on my computer) and does not generate errors.

On the other hand, I am looking but I cannot find how to concretely use the posterior predictive distribution from my program and generate the histogram of the maximum values of the samples drawn from this distribution.

Well, I am training myself in Bayesian analysis and have a hard time manipulating and using PyMC outputs which I am not yet familiar with; If you could give me some pointers…

The code used is:

import numpy as np
import pymc as pm
import arviz as az

np.random.seed(63123)
data = np.random.normal(loc = 600, scale = 30, size = 20)

with pm.Model() as model:
    
    # Priors for unknown model parameters
    mu = pm.Uniform('mu', lower = 0, upper = 2000)
    sigma = pm.Uniform('sigma', lower = 0, upper = 100)

    # Likelihood of observations
    lkd = pm.Normal('likelihood', mu = mu, sigma = sigma, observed = data)
    
    # Expected value of outcome (WHAT IS THE USE ???)
    weights = pm.Normal('weights', mu = mu, sigma = sigma)
    
    # Inference data
    idata = pm.sample()
    
    # Predictives
    pm.sample_prior_predictive(samples=500, var_names=None, random_seed=None, 
                               return_inferencedata=True, 
                               idata_kwargs=None, compile_kwargs=None)    
    
    pm.sample_posterior_predictive(idata, 
                                   var_names=None, sample_dims=None, random_seed=None, 
                                   progressbar=True, return_inferencedata=True, 
                                   extend_inferencedata=True, predictions=True, 
                                   idata_kwargs=None, compile_kwargs=None)    
1 Like

If you grab the posterior predictive samples like this:

ppc = pm.sample_posterior_predictive(idata)

Then you should be able to get the max of each simulated data set like this:

az.extract(ppc.posterior_predictive.max(dim=["likelihood_dim_2"]))["likelihood"]

ppc.posterior_predictive.max(dim=["likelihood_dim_2"]) takes the max over simulated data sets. az.extract() flattens this down into a 1 dimensional array. The ["likelihood"] grabs the variable of interest (i.e., the observed data in your original model). If you prefer to work with a numpy array, you can take the values:

az.extract(ppc.posterior_predictive.max(dim=["likelihood_dim_2"]))["likelihood"].values

If you pass extend_inferencedata=True to pm.sample_posterior_predictive(), then you should be able to do the same with the idata object:

az.extract(idata.posterior_predictive.max(dim=["likelihood_dim_2"]))["likelihood"]

There are some notation issues involved in this. Posterior predictive checks can be used to mean both “comparison of posterior predictive (and pushforward quantities) to observed data” and the most common type of posterior predictive checks (using here the first definition). This first meaning is also often called “model checking” or “model criticism”. All plots in Example gallery — ArviZ dev documentation are about model checking, but only the middle row uses plot_ppc. The plot_ppc funcion, as you have probably guessed by now uses the 2nd meaning, it compares raw samples from posterior predictive and observed data.

AFAIK, it isn’t possible yet to get the two colors automatically with ArviZ, but you want to use plot_bpv. Something like:

az.plot_bpv(idata, kind="t_stat", t_stat=np.max)

You will get a plot similar to the last example shown in the docstring page I shared above for plot_bpv, the distribution of the maxima for each sample will be plotted and the integral of the orange part will be written down in the legend

1 Like

In reply to OriolAbril:
Sounds great, but the idata I get after running the program (as I wrote it) contains only:

  • posterior
  • predictions
  • sample_stats
  • observed_data
    I guess this is why when I run your line
az.plot_bpv(idata, kind="t_stat", t_stat=np.max)

I get an error saying:

`data` argument must have the group "posterior_predictive"

Surely because this group is not present in my idata
I probably missed something in my code, isn’t it right?

you need to use predictions=False. For convenience ArviZ distinguishes between predictions made on the same data that was used to fit the model (useful for model checking, it gets stored in posterior_predictive) and out of sample predictions (actual predictions, they get stored in the predictions group).

Side note, I now see you are not saving the results of prior predictive sampling anywhere. Potentially useful references: PyMC 4.0 with labeled coords and dims — Oriol unraveled, Prior and Posterior Predictive Checks — PyMC 5.1.2 documentation

Sorry for the confusion on predictions=True/False. I did the changing and I now get no more errors. Nevertheless, the plot produced is definitely surprising…
I obviously missed something else!

image

Do the results change if you delete this line? I assume the real values of 600, 30 are recovered? (you can use arviz.summary or arviz.plot_posterior

Oh, I think I see. using np.max straight away reduces too many dimensions. Try t_stat=lambda x: np.max(x, axis=-1)

Nice! I got the plot and a p-value of 0.26. According to my previous habits on the p-values (as a former frequentist statistician…), it looks very bad. Would that also be your conclusion?
By the way, the plot contains an isolated point that I don’t know how to interpret. Would it be possible to know a little more?
image

This has little to no relation to frequentist p-values. Their only relation is they are both probabilities, hence the p-value name.

As I said above, this is returning the integral of the orange part of the histogram/probability density. You compute the maximum (or any other summary statistic, the summary is generally called T, hence the name of t_stat, again not very meaningful) of the observed data and plot the result. This is the point you see (the line where color changes in the video). Then you compute the maximum of each posterior predictive sample and plot the histogram/kde of all those values. This is the curve (the histogram in the video you shared).

The bayesian p-value is the integral of the density below the dot, that is, the probability that the T of the posterior predictive is smaller than the T of the observed data (in your case, T=maximum). The expectation is for that value to be around .5 as ideally if we can treat the 1000 samples from the posterior predictive as samples from the true data generating process. But all values are possible and probable even if not as probable as 0.5.

I generally prefer checking with raw samples, and use a T statistic later on if I am specially interested in that summary. I wrote LOO-PIT tutorial — Oriol unraveled which should be useful for plot_bpv (with kind="u_stat") and with plot_loo_pit (used in the blog post)

1 Like

Thanks to all, OriolAbril for the time taken to explain me my misunderstandings and cluhmann for his appropriate suggestions, with whom I have not taken the time to correspond.
I will dive into the documentation provided and I believe I am better equipped now to grasp how PyMC works.
Maybe we can close this topic as resolved?

2 Likes