Example for out-of-sample prediction with posterior predictive sampling

Hi all,
I am updating some code which I initially wrote during the pymc3/pymc4 transition. It seemed back then that things would get simpler based on a preliminary version of v4.0.0, but apparently plans changed.

In particular, I am confused by the apparent decision to prohibit allowing a different number of “samples” in a future release. (see here and here, though I admit I could not follow the latter discussion).

Here is my use case:

  • my model has multiple thetas, so it would be grate to have multivariate posteriors
  • it is a linear model, the intercept and all the slopes are multiplied with pm.Data objects (for the former, this is just ones).
  • sampling works fine; I have a model and a trace
  • I then want to do perform out of sample prediction: I have a subset of my observations which was excluded in model fitting; hence I adjust data with pm.set_data()
  • I would then like to have n predictions for each of those novel observations; emphasis that n can be an arbitrary (high) number: it is independent of the number of original observations, the number of posterior predictions, or the number of chains.

Is there any recent example of how to do this? Or could someone provide one?

Before v4, I had the workaround of iterating my novel observations, repeating the so that the data shape matches the original data, and then sampling repeatedly to get a sufficienntly high number of predictions, always a multiple of the number of original observations. But that is increadibly cumbersome.

I am posting this here because I fear that, discussing the changes to sample_posterior_predictive', you might not have had the use case of set_data` out-of-sample prediction on your radar.
Thanks a lot!


(PS: my actual data is more involved; you can find all details here and here)

1 Like

I think the some of the reasoning was that allowing samples as an argument, people were often very confused about the shape of the posterior predictive object that was returned. For example, one might call sample_posterior_predictive() with samples=100, but then end up with an array of shape 100 (samples) x N (number of observations) plus some dimensions reflecting the number of parameters and people would wonder why the number of samples seemed to be 100N rather than the requested 100. If you need an arbitrary number of predictive samples, it’s probably ideal to generate a corresponding number of samples during sampling itself.

The problem here is that generating multiple predictive samples from a single posterior sample creates artificial dependencies in your predictions. So it’s not a great idea to call pm.sample(1000) and then try to generate 5000 predictive samples. If you need 5000 predictive samples, you should probably generate 5000 samples from your posterior and then 1 predictive sample for each. If you need 3 predictive samples, you should probably still generate a reasonable number of posterior samples (e.g., 1000) to ensure decent sampling, toss a random 997 of them, and then call sample_posterior_predictive() with the 3 that remain.

people would wonder why the number of samples seemed to be 100N rather than the requested 100

Yes, I remember also having figured that out by trying different shape combinations of test data :slight_smile:

True, and there used to be an adequate warning for it (but see below).

I do not understand this. Why should I need only 3 posterior samples and toss 997? This is against my understanding of Bayesian statistics, and I would like to get terminology strait with a hypothetical example.
Imagine I have a model that has the form \theta=a+bx. I have 100 observations of (x,y) pairs and x takes values between zero and one. This means, in the case when (x=0), the y values are close to a, and for the other values they approach a+b.
By sampling (10000 samples) I “fit” the parameters and retrieve the distributions of a and b in the form of a trace.

Let’s say I then want to extrapolate my observations and estimate what would happen for observations of x=1.1. Note that that value was not in the data set.
I use pm.set_data to set data to a single line with x=1.1. And then I would like to get, say, samples=5000 random draws of what y would be with the (approximately) true underlying distribution of a and b. Any random value from the model parameter posteriors is fine, because their likelihood follows the true distribution. So, for me, sample_posterior_predictive might well draw random values from the trace. If it drew randomly from only three of the trace values, that would not be right!

As a user, I expect exactly 5000 output values. Of course, (5000, N) is confusing. And even worse for comprehension is that sample_posterior_predictive regularly fails when data shape changes.

This has been spinning my head whenever I get to it, and I have been fixing it with dirty workarounds. Therefore I really appreciate if someone could clear it up!
Thanks, folks!

edit: I simplified here: a and b are of course a distribution, in case of pm.Normal I would get values to approximate their mean and standard deviation. But that does not change the picture: only the full trace leads me to the true posterior model parameter values. Or did I get that wrong?

The “needing 3 samples” I was referring to was the situation in which you wanted 3 credible values for a each observation, y_i (i.e., 3 predictive samples). In such a case, it is not recommended that you call pm.sample(3) followed by a call to sample_posterior_predictive(). Instead, the recommendation is to call pm.sample() requesting a reasonable number of samples (e.g., pm.sample(1000)) and then use a subset of these samples to generate the desired predictive samples (each predictive sample derived from a different posterior sample). Needing 3 predictive samples is unrealistic, but it’s conceivable that you might need 10,000 samples to get decent sampling diagnostics, but not want anywhere need that may predictive samples. So cutting down the posterior before predictive sampling might make sense in such situations.

1 Like

Okay, I think I understand; so good we mean the same thing, and just the “3” confused me.
But what is the purpose then of letting the user do the “cutting down the posterior” (supposedly as exampled here)?
Is there any reason to move this process to the user? I guess if there would be a “samples” argument, if it was well documented, maybe defaulting to the number of samples in the original trace, then why would you not have the random draw from the inference data/trace integrated into sample_posterior_predictive?

At least, to me, this seems highly non-intuitive and would require more documentation. I saw that the github discussion about it was more centered around keep_size and return data type; that seem rather technical decisions.
But I now also see that there is already huge work in progress, so I’ll see how to get along.

From a user-standpoint, I think the idea was to take a process that some users found to be super confusing (the previous API) and make it somewhat less confusing so that those users now find it only moderately confusing (the new API). You can ask those who were involved in the PR for their take. The way it works now is that sample_posterior_predictive gives you an entire Y for each sample in your posterior. If you want something other than that, then you have to do the work to figure out how. But that’s always the case. There’s an infinite number of things one could want to do/check/predict and you can always derive quantities of interest once you have the posterior. But if you want the simplicity of sample_posterior_predictive(), it’s going to try and just do a reasonable default. At least that’s my take.

Sorry about the late answer, hopefully it will still be useful.

The main motivation for this is making the defaults more sensible and nudging users (especially basic and average ones) towards best practices. In the vast majority of cases, one should draw one posterior predictive sample per posterior sample. Generating less samples means loosing information and generating more does not increase the precision. There are reasons to do both, but they should be done carefully.

There has been some v4 versions during which generating multiple posterior predictive samples per posterior draw wasn’t directly possible. But it is now possible again (in main only for now) using sample_dims:

expanded_data = idata.posterior.expand_dims(pred_id=5)
with model:
        expanded_data, sample_dims=["chain", "draw", "pred_id"]

v3 had a size argument to generate multiple draws per posterior sample but it was removed in v4 because it generated more problems than solutions, had been broken for a while and nobody complained.

The use of samples was also inconsistent between versions and might not be doing what users expect. Let’s take the case when samples is smaller than n_chains * n_draws. Should pymc get 1 every m draws selecting all chains? 1 every k sample (flattening chain and draw dimensions), getting the first draws/chains until we have samples draws? or maybe using random subsets? with repetition or without repetition? And what about the applications that need to have the “pairing” of posterior predictive draw with the posterior draw that was used to generate it? v3 used at least two of these approaches.

In my opinion, the headaches caused by all of this, by samples, size and keep_size are larger than moving this step to the user. Now to generate 1 posterior predictive draw every 5 posterior samples you can do

# store subsetted inferencedata
thinned_idata = idata.sel(draw=slice(None, None, 5))
with model:
# do not store it
with model:
        idata.sel(draw=slice(None, None, 5))

or to generate posterior predictive samples for a random subset of the posterior you can do:

post_subset = az.extract(idata, num_samples=100)
with model:
        post_subset, sample_dims=["sample"]

We do force users to be a bit more explicit than before, but it is not prohibited. And hopefully it will result in the result being what is expected more often (or always).


Thank you for taking the time to answer in such detail, @OriolAbril , this is all very useful for me to get going again with my models!

This is a core point I had to get my head around.

The original reason for me to draw less samples was a performance issue, which was due to a shape limitation when using LKJ priors. This has been solved in the meantime, prediction works in reasonable time, and my comparison models are happily sampling.

Technical note: the following line is a bit enigmatic to me:

thinned_idata = idata.sel(draw=slice(None, None, 5))

I am not so much into xarray syntax and had little use for slice except for xarray indexing; i learned this once in a scipy conference workshop but keep forgetting due to lack of use.
In the case of model inference data thinning, I found the second argument of slice to be the one to go. But that is just trying around. Might be worth porting some basic xarray documentation to the pymc or arvic api docs?

Thanks again!

Time and memory constraints are the main reasons to reduce the number of generated samples. Hopefully both factors will be easier and easier to overcome as we continue working on pymc.

For the slice syntax, this is a python syntax limitation. We use slice quite often but are generally not aware we are doing it because they are generally used inside the where the syntax ::5 can be used. The square brackers however do not accept keyword arguments, so xarray uses this sel method where draw=::5 is not valid python syntax, so it is necessary to use the slice builtin explicitly with slice(None, None, 5). Each colon is a comma when calling slice:

:50  =  slice(None, 50)  # also slice(50)
-3:  =  slice(-3, None)
4:200:3  =  slice(4, 200, 3)

Would using the arguments as keyword instead of positional ones have made the example in the api page more clear? That would be slice(start=None, stop=None, step=5)