How to use the posterior predictive distribution for checking a model from PyMC

Andre · March 14, 2023, 12:27pm

I’m a little confused because this question is not of great scientific interest and rather relates to a lack of knowledge on my part.

I am trying to understand how to use a posterior predictive check (PPC) after building a bayesian model using the PyMC library. I’ve never done this before and I don’t quite understand how I should proceed…

Concretely, I would like to be able to generate a large number of samples from the posterior predictive distribution and compare them to the sample I used to build the model, using a simple criterion: the maximum value of the samples. I refer for this to Ben Lambert’s presentation video (What is a posterior predictive check and why is it useful? - YouTube) (see at 8:15) where he presents the following graph, which perfectly expresses what I would like to get:

So I built a simple model, where I gave myself a sample of 20 values that are supposed to come from a normal distribution with mean mu and standard deviation sigma that I am trying to estimate. (note that I gave myself a mu and a sigma to generate the sample which is the basis of the model).

I then built the code referring to various examples and included the sample_prior_predictive and sample_posterior_predictive instructions according to the information given by the API — PyMC 5.1.1 documentation (API — PyMC 5.1.2 documentation) in the “samplers” paragraph The code works (on my computer) and does not generate errors.

On the other hand, I am looking but I cannot find how to concretely use the posterior predictive distribution from my program and generate the histogram of the maximum values of the samples drawn from this distribution.

Well, I am training myself in Bayesian analysis and have a hard time manipulating and using PyMC outputs which I am not yet familiar with; If you could give me some pointers…

The code used is:

import numpy as np
import pymc as pm
import arviz as az

np.random.seed(63123)
data = np.random.normal(loc = 600, scale = 30, size = 20)

with pm.Model() as model:
    
    # Priors for unknown model parameters
    mu = pm.Uniform('mu', lower = 0, upper = 2000)
    sigma = pm.Uniform('sigma', lower = 0, upper = 100)

    # Likelihood of observations
    lkd = pm.Normal('likelihood', mu = mu, sigma = sigma, observed = data)
    
    # Expected value of outcome (WHAT IS THE USE ???)
    weights = pm.Normal('weights', mu = mu, sigma = sigma)
    
    # Inference data
    idata = pm.sample()
    
    # Predictives
    pm.sample_prior_predictive(samples=500, var_names=None, random_seed=None, 
                               return_inferencedata=True, 
                               idata_kwargs=None, compile_kwargs=None)    
    
    pm.sample_posterior_predictive(idata, 
                                   var_names=None, sample_dims=None, random_seed=None, 
                                   progressbar=True, return_inferencedata=True, 
                                   extend_inferencedata=True, predictions=True, 
                                   idata_kwargs=None, compile_kwargs=None)

cluhmann · March 14, 2023, 4:04pm

If you grab the posterior predictive samples like this:

ppc = pm.sample_posterior_predictive(idata)

Then you should be able to get the max of each simulated data set like this:

az.extract(ppc.posterior_predictive.max(dim=["likelihood_dim_2"]))["likelihood"]

ppc.posterior_predictive.max(dim=["likelihood_dim_2"]) takes the max over simulated data sets. az.extract() flattens this down into a 1 dimensional array. The ["likelihood"] grabs the variable of interest (i.e., the observed data in your original model). If you prefer to work with a numpy array, you can take the values:

az.extract(ppc.posterior_predictive.max(dim=["likelihood_dim_2"]))["likelihood"].values

If you pass extend_inferencedata=True to pm.sample_posterior_predictive(), then you should be able to do the same with the idata object:

az.extract(idata.posterior_predictive.max(dim=["likelihood_dim_2"]))["likelihood"]

OriolAbril · March 14, 2023, 4:04pm

There are some notation issues involved in this. Posterior predictive checks can be used to mean both “comparison of posterior predictive (and pushforward quantities) to observed data” and the most common type of posterior predictive checks (using here the first definition). This first meaning is also often called “model checking” or “model criticism”. All plots in Example gallery — ArviZ dev documentation are about model checking, but only the middle row uses plot_ppc. The plot_ppc funcion, as you have probably guessed by now uses the 2nd meaning, it compares raw samples from posterior predictive and observed data.

AFAIK, it isn’t possible yet to get the two colors automatically with ArviZ, but you want to use plot_bpv. Something like:

az.plot_bpv(idata, kind="t_stat", t_stat=np.max)

You will get a plot similar to the last example shown in the docstring page I shared above for plot_bpv, the distribution of the maxima for each sample will be plotted and the integral of the orange part will be written down in the legend

Andre · March 14, 2023, 5:15pm

In reply to OriolAbril:
Sounds great, but the idata I get after running the program (as I wrote it) contains only:

posterior

predictions

sample_stats

observed_data
I guess this is why when I run your line

az.plot_bpv(idata, kind="t_stat", t_stat=np.max)

I get an error saying:

`data` argument must have the group "posterior_predictive"

Surely because this group is not present in my idata…
I probably missed something in my code, isn’t it right?

OriolAbril · March 14, 2023, 5:21pm

you need to use predictions=False. For convenience ArviZ distinguishes between predictions made on the same data that was used to fit the model (useful for model checking, it gets stored in posterior_predictive) and out of sample predictions (actual predictions, they get stored in the predictions group).

Side note, I now see you are not saving the results of prior predictive sampling anywhere. Potentially useful references: PyMC 4.0 with labeled coords and dims — Oriol unraveled, Prior and Posterior Predictive Checks — PyMC 5.1.2 documentation

Andre · March 14, 2023, 5:34pm

Sorry for the confusion on predictions=True/False. I did the changing and I now get no more errors. Nevertheless, the plot produced is definitely surprising…
I obviously missed something else!

OriolAbril · March 14, 2023, 5:39pm

Do the results change if you delete this line? I assume the real values of 600, 30 are recovered? (you can use arviz.summary or arviz.plot_posterior

OriolAbril · March 14, 2023, 5:40pm

Oh, I think I see. using np.max straight away reduces too many dimensions. Try t_stat=lambda x: np.max(x, axis=-1)

Andre · March 14, 2023, 5:53pm

Nice! I got the plot and a p-value of 0.26. According to my previous habits on the p-values (as a former frequentist statistician…), it looks very bad. Would that also be your conclusion?
By the way, the plot contains an isolated point that I don’t know how to interpret. Would it be possible to know a little more?

OriolAbril · March 14, 2023, 6:30pm

This has little to no relation to frequentist p-values. Their only relation is they are both probabilities, hence the p-value name.

As I said above, this is returning the integral of the orange part of the histogram/probability density. You compute the maximum (or any other summary statistic, the summary is generally called T, hence the name of t_stat, again not very meaningful) of the observed data and plot the result. This is the point you see (the line where color changes in the video). Then you compute the maximum of each posterior predictive sample and plot the histogram/kde of all those values. This is the curve (the histogram in the video you shared).

The bayesian p-value is the integral of the density below the dot, that is, the probability that the T of the posterior predictive is smaller than the T of the observed data (in your case, T=maximum). The expectation is for that value to be around .5 as ideally if we can treat the 1000 samples from the posterior predictive as samples from the true data generating process. But all values are possible and probable even if not as probable as 0.5.

I generally prefer checking with raw samples, and use a T statistic later on if I am specially interested in that summary. I wrote LOO-PIT tutorial — Oriol unraveled which should be useful for plot_bpv (with kind="u_stat") and with plot_loo_pit (used in the blog post)

Andre · March 14, 2023, 6:49pm

Thanks to all, OriolAbril for the time taken to explain me my misunderstandings and cluhmann for his appropriate suggestions, with whom I have not taken the time to correspond.
I will dive into the documentation provided and I believe I am better equipped now to grasp how PyMC works.
Maybe we can close this topic as resolved?

Topic		Replies	Views
Posterior predictive check question	13	85	November 18, 2024
Could somebody provide a minimal example for sample_posterior_predictive() Questions	2	422	April 15, 2021
Posterior predictive checks for dimension and with applied filters on values v5 modeling	0	36	July 15, 2024
How to get postierior predictive mean in az.plot_ppc version agnostic modeling , arviz	1	550	July 7, 2023
Unsure how to proceed with prior and posterior predictive checking for Bayesian multiple logistic regression v5 arviz , model-checking	6	1145	July 6, 2023

How to use the posterior predictive distribution for checking a model from PyMC

Related topics