Example for out-of-sample prediction with posterior predictive sampling

OriolAbril · October 27, 2022, 11:17am

Sorry about the late answer, hopefully it will still be useful.

The main motivation for this is making the defaults more sensible and nudging users (especially basic and average ones) towards best practices. In the vast majority of cases, one should draw one posterior predictive sample per posterior sample. Generating less samples means loosing information and generating more does not increase the precision. There are reasons to do both, but they should be done carefully.

There has been some v4 versions during which generating multiple posterior predictive samples per posterior draw wasn’t directly possible. But it is now possible again (in main only for now) using sample_dims:

expanded_data = idata.posterior.expand_dims(pred_id=5)
with model:
    idata.extend(pymc.sample_posterior_predictive(
        expanded_data, sample_dims=["chain", "draw", "pred_id"]
))

v3 had a size argument to generate multiple draws per posterior sample but it was removed in v4 because it generated more problems than solutions, had been broken for a while and nobody complained.

The use of samples was also inconsistent between versions and might not be doing what users expect. Let’s take the case when samples is smaller than n_chains * n_draws. Should pymc get 1 every m draws selecting all chains? 1 every k sample (flattening chain and draw dimensions), getting the first draws/chains until we have samples draws? or maybe using random subsets? with repetition or without repetition? And what about the applications that need to have the “pairing” of posterior predictive draw with the posterior draw that was used to generate it? v3 used at least two of these approaches.

In my opinion, the headaches caused by all of this, by samples, size and keep_size are larger than moving this step to the user. Now to generate 1 posterior predictive draw every 5 posterior samples you can do

# store subsetted inferencedata
thinned_idata = idata.sel(draw=slice(None, None, 5))
with model:
    idata.extend(pymc.sample_posterior_predictive(thinned_idata))
# do not store it
with model:
    idata.extend(pymc.sample_posterior_predictive(
        idata.sel(draw=slice(None, None, 5))
))

or to generate posterior predictive samples for a random subset of the posterior you can do:

post_subset = az.extract(idata, num_samples=100)
with model:
    idata.extend(pymc.sample_posterior_predictive(
        post_subset, sample_dims=["sample"]
))

We do force users to be a bit more explicit than before, but it is not prohibited. And hopefully it will result in the result being what is expected more often (or always).

Topic		Replies	Views
Could somebody provide a minimal example for sample_posterior_predictive() Questions	2	413	April 15, 2021
Out of Sample Predictions on heirarchical regression model (PyMC - v5.0.1) v5	5	571	March 2, 2023
Sample_posterior_predictive() works fine in PyMC 3, raises exception in v4 v5	4	581	October 10, 2022
How to change number of samples drawn by sample_posterior_predictive v3 modeling	4	1182	April 5, 2022
How to make out-of-sample predictions with pymc model v5	1	653	February 8, 2023

Example for out-of-sample prediction with posterior predictive sampling

Related topics