General question about PPC and predictions

I was performing some standard model checking and realized strange outputs when recreating plots for posterior predictive checks. My question closely relates to this one; however, I am wondering how this extends into prediction. I created a gist that provides all of the details.

Basically, I see that the distribution of values from a single draw align nicely with the distribution of my observed data. However, the distribution of the mean of the draws does not.

Using plot_ppc from arviz:
ppc1

Same thing but by hand:

The dark blue lines in mine are the same as the blue lines in the arviz version. The baby blue in mine is the same as the orange in the arviz version. The KDE of all the samples is not the same as the KDE of the mean of the samples. Similarly, red in mine is the same as black in arviz. (probably should have just matched colors…)

In the linked post @OriolAbril points out how this fits within a Bayesian context, but my question is how this extends into prediction. For example, in the example I provided, we want to make some predictions on a test set. I completely understand the point that a main advantage of Bayesian methods is that we obtain a distribution instead of just a point estimate. However, in some cases we may need to use a point estimate as our prediction for use in real-world applications (also I get that the advantage is that we can choose what that point estimate is, e.g., MAP, median, …).

Does taking the mean of our predictions (e.g., over chain and draw) have the same impact that it does when we are performing posterior predictive checks? Given that predictions are basically posterior checks on new observations, I don’t see how this would not be something to consider here.

Sorry if this is very basic, but I don’t see quite as much discussion around predictions.

Thanks.

Well lets assume that your data is distributed normally. When you draw from your posterior predictive say 1000 times, the first coordinate of each of your draws will be like a random draw from that distribution (weighted by the posterior of parameters). So when you mean it, it will essentially shrink to the mean of your distribution (with a sd of ~1/1000 if you collect them into a distribution from all the other coordinates). You can see it via a sample script such as:


import pymc as pm
import numpy as np
import arviz as az
import matplotlib.pyplot as plt

seed = 0
rng = np.random.default_rng(seed)
N = 100

data = np.random.normal(0, 1, size=N)

with pm.Model() as model:
  
  mu = pm.Normal("mu", 0, 10, size=1)
  sd = pm.Exponential("sd", 1, size=1)
  
  obs = pm.Normal("obs", mu, sd, observed=data)
  
  idata = pm.sample(1000, chains=6, cores=6)
  
with model:
  pm.sample_posterior_predictive(idata, extend_inferencedata=True)
  
az.plot_ppc(idata)

ppc_data = np.array(idata["posterior_predictive"].to_array()).squeeze()
mean_ppc_data = ppc_data.mean(axis=0).mean(axis=0)

fig,ax = plt.subplots(1, 3, figsize=(15,5), sharex=True)

ax[0].hist(data, color="tab:blue", alpha=0.5, density=True)
ax[0].set_title("data")
ax[1].hist(mean_ppc_data, color="tab:blue", alpha=0.5, density=True)
ax[1].set_title("mean_ppc")
ax[2].hist(ppc_data[:,:,0].flatten(), color="tab:blue", alpha=0.5, density=True)
ax[2].set_title("ppc first coordinate")


So essentially I guess you are getting a (posterior) distribution for the mean of your posterior predictive distribution (whose sd would scale like 1/N)

1 Like

Very interesting. This makes sense. I can see how in the plot that the I provided the black distribution is definitely centered at the mean with a much tighter distribution.

I am wondering if this has implications for prediction? Does that mean that whenever I take the mean (say across chains and draws), my estimate for that instance is strongly biased toward the posterior mean? This makes sense to some extent, but it seems like there could be too much biasing happening.

I bring this up because in my applications I was noticing that the distribution of the mean – .mean(("chain","draw")) – was much more narrow that the distribution of each draw (as I showed in the second plot in the post). That makes me concerned that in the cases where I must push forward with a point estimate, I am getting biased results.

I don’t intend to be snarky, but weird things can happen when you throw away information. If you provide some information about what the “push forward” looks like, there may be alternatives to a) using point estimates or b) using the posterior mean. What are you using those point estimates do to?

No snark perceived. :pray:

I currently work in a sports setting, so many of the models I build are used for decision making. Its a variety of applications from stat lines to biomechanical models. In some cases, reporting distributions makes the most sense. However, in other cases, it is useful/necessary to have single values as a final estimate. For example, once a model is built, I might want to perform inference on cases in which we make specific observations–not quite a pm.do, but more like a pm.set_data({...}).

For example, consider the case where I have a multiple regression model:

features = ["a", "b"] # "c", "d", ... 
with pm.Model(coords={"features": features}) as lm:

    # Data as mutable
    X = pm.MutableData("X", data[features].values)

    # Params
    alpha  = pm.Normal("alpha", 0, 3)
    beta   = pm.Normal("beta",0,10, dims="features")
    sigma     = pm.HalfNormal("sigma",10)

    # y_hat
    mu       = alpha + (X * beta).sum(axis=-1)
    y_hat    = pm.Normal("y_hat",mu=mu, sigma=sigma, observed= y, shape=X.shape[0])

In my application, I want to do two things:

  1. Obtain a model that can be used for inference on real observations (i.e., new ["a", "b"]) and decision making.
  2. Inference on unobserved, or synthetic observations.

For example,

new_X = X_new_observations[["a","b"]].values
with model:
     pm.set_data({"X": new_X})
     y_hat_pred = pm.sample_posterior_predictive(...., predictions=True)

In the case that I have only few observations of X, I see how comparing their distributions would be very useful. If I am reporting my results to decision makers or stake holders (that term :roll_eyes: ) , it may be completely infeasible to report distributions for every new observation of X, especially when there are many observations (say thousands).

For a more tangible example, consider this example notebook on baseball batting averages. For this example there are only 18 players. However, in real life, there are hundreds or thousands depending on the scope. If I were asked to provide a summary table, it would be unreasonable to provide a figure showing thousands of instances in a forest plot. It’s a little contrived but I hope the point is clear.

For the second case, I may want to perform an experiment where I vary values of interest:

new_X = np.array([(a,b) for a in np.linspace(0,1,10) for b in np.linspace(0,1,10)])

In this case, I want to understand how changing values of each input impact my output, y_hat. For only few values, it may be feasible to look at the distribution for each combination of values. But this becomes challenging when varying over a large range of values for multiple inputs.

e.g., I could take the mean for each input combination and generate a heat map. I could see ways in which preserving the distribution is possible, but also some cases where it becomes difficult. I see how partial dependence plots would allow us to preserve the uncertainty when examining the impact of a single variable. However, this gets more challenging when considering the joint impact of varying multiple inputs.

I know that I am throwing away a lot of information when I compress my estimates into a single value. However, in some cases, I am asked to provide a single point as a final estimate. I prefer the Bayesian framework since 1. uncertainty is inherent within the framework and not some hacky a posteriori thing, and 2. even if I must report a point estimate, I have the distribution for any thing that requires further investigation. Nonetheless, I have cases where I may be asked to provide a point estimate and I am unclear on how best to make that point estimate from my model.

I hope this clarifies things a bit more.

1 Like

Following this discussion with interest, because I agree prediction isn’t really talked about enough.

Question: why is it surprising to you that the mean is more peaked than the data distribution? In the Gaussian case (your example notebook), the data distribution is a combination of the mean and noise. Taking the mean of the PPC should recover something close to the mu parameter used to define the data distribution, integrating out the Gaussian noise.

1 Like

Hi @jessegrabowski thanks for jumping in. I am definitely interested to hear more on predictions with Bayesian models. It seems that often the discussion around Bayesian models is on their many strengths (e.g., thinking about the data generating process, prior assumptions, propagating uncertainty,…), but less on what to do with them when making predictions. On the other hand, something like xgboost is often discussed as a prediction powerhouse, especially in industry applications.

To answer your question, I guess I had thought about the mean of the PPC to represent the distribution of each sample mean (e.g., over ('chain', 'draw')) as opposed to the distribution of the mean. My background is engineering and I only came to statistics in the past few years during my search for models with more explanatory power, so this could just be a consequence of my lack of understanding of Bayesian models at a more fundamental level.

Taking the mean of the PPC should recover something close to the mu parameter used to define the data distribution, integrating out the Gaussian noise.

The first part makes sense to me since since it is directly related to our model assumptions:

    mu       = intercept + slope*xX
    y_hat    = pm.Normal("y_hat",mu=mu, sigma=sigma, ...

However, I think integrating out the Gaussian noise is maybe where I got lost. I assumed that the mean PPC would represent the mean of mu with noise sigma. Here is an additional figure from my first example in the linked notebook to maybe help clarify my confusion.

Both axes are the same–the one on the right is reduced y limits so we can see more clearly. The gray histogram is the data and the red line is the KDE of the data distribution. Each of the dark blue lines represent the the distribution for a single draw (not shown in legend). Baby blue is the distribution of all values over chain and draw (which matches what is shown by az.plot_ppc).

ppc_samples = az.extract(ppc, group="posterior_predictive", num_samples=num_samples)
ppc_samples["y_hat"].values.flatten()

Black is the distribution of each value of y_hat averaged over ('chain', 'draw')

ppc_samples["y_hat"].mean("sample").values

Orange is the distribution of sampled values centered at mean(mu) and mean(sigma). So if we compute the summary as summary = az.summary(idata), we get:

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
mu 9.992 0.03 9.932 10.045 0.0 0.0 7794.0 6157.0 1.0
sigma 0.977 0.022 0.936 1.017 0.0 0.0 8103.0 6080.0 1.0

Now we draw samples:

samples1 = pm.draw(pm.Normal.dist(summary.loc["mu"]["mean"],summary.loc["sigma"]["mean"]), draws=10_000)

The green line represents samples with values:

samples2 = pm.draw(pm.Normal.dist(summary.loc["mu"]["mean"],summary.loc["mu"]["sd"]), draws=10_000)

My point of confusion is that I expected the black line to be what the orange line is showing. The green make sense to me, although I am a bit confused on why green is so much more peaked than black.

I am not entirely sure if this is what you are after, but if your question is how to replicate the arviz posterior predictive mean manually then it’s not trivial to do so exactly but approximately this will get you there based on your gist:

# Define histogram bins; eye-balling here
x_bins = np.linspace(6, 14, 9)
# And for later plotting convenience their midpoints
xs = (x_bins[:-1] + x_bins[1:])/2

# For each observation (you have 1000) build a density histogram of your ppc draws
pp_hists = []
for vals in ppc_samples["y_hat"]:
    vals = vals.values.flatten()
    pp_hist, _ = np.histogram(vals, bins=x_bins, density=True)
    pp_hists.append(pp_hist)

# Now average your histograms (across your 1000 observations) per bin
pp_mean_hist = np.array(pp_hists).mean(0)

pp_mean_hist should be the equivalent of arviz’s “posterior predictive mean” which you can plot using lineplot (rather than as a kde, since there is no more raw observational/simulation data to histogram or kde - we’ve averaged it all out). Here is the the plotting code from your gist with the one line changed:

# Define figure
fig, ax = plt.subplots(1,1)
fig.set_size_inches(8,5)

# Extract samples
num_samples = 50
ppc_samples = az.extract(ppc, group="posterior_predictive", num_samples=num_samples)

# Observed data
sns.histplot(Y,color="lightgrey",bins=21,stat="density",label="Observed data")
# Plot KDE of each of the samples
sns.kdeplot(ppc_samples["y_hat"].values,palette=num_samples*['#003278'],alpha=.5,common_norm=False,legend=False)
# Plot KDE of all sampled values flattened
sns.kdeplot(ppc_samples["y_hat"].values.flatten(),color="#6BACE4" ,alpha=1,label="KDE all samples")
# THIS IS THE CHANGED LINE: Plot KDE of mean of sampled values
sns.lineplot(x=xs, y=pp_mean_hist, color='k' ,alpha=1,label="KDE mean of samples")
# KDE of observed data
sns.kdeplot(Y,color='#C0111F',alpha=1,label="KDE Observed")
ax.legend()

Which gives me this, with the black and red lines similar:

Something similar can be done on out of sample predictions - you’d still have multiples histograms per out of sample prediction, which you can still average per bin. I haven’t look into arviz in detail but I’d be surprised if it can’t do it for you.

2 Likes

Hi @Genesis , thanks for your reply. This is a nice solution to replicating the plot_ppc function from arviz.

My question gets more at the heart of why the distribution of

ppc["posterior_predictive"]["y_hat"].mean(("chain","draw")).values

is so much more narrow than

ppc["posterior_predictive"]["y_hat"].values.flatten()

And to further clarify, I guess I was a bit confused on why the mean of the PPC is not the same as a new distribution centered at mean(mu) with noise mean(sigma) (which fits the data), and how this plays a role in prediction when using sample_posterior_predictive for new observations.

Thanks for your input on this!

I am not sure how to explain this without getting into the integrals and whatnot but to give a, hopefully, indicative example. Imagine your ppc consists of exactly 1 sample (so you “y_hat” has length 1) and you have 1 chain and let’s say 100 draws.

ppc["posterior_predictive"]["y_hat"].values.flatten() will have a 100 values and have some non-zero variance.

Meanwhile, ppc["posterior_predictive"]["y_hat"].mean(("chain","draw")).values will have exactly 1 value, which definitionally will have zero variance. So already you see that the latter is going to have a smaller variance than the former (and thus the distribution will be narrower)

ppc["posterior_predictive"]["y_hat"].values.flatten() contains two sources of variance - the variance across y_hat and the variance across sampler draws for a fixed y_hat. Meanwhile ppc["posterior_predictive"]["y_hat"].mean(("chain","draw")).values contains only the former source of variance.

2 Likes

Do you have access to Bayesian Data Analysis 3rd edition (Gelman’s book)? It might be instructive to look at a worked example for a conjugate model where you can see the functional form of the predictive distribution. You could check out section 1.3 to explain the predictive distribution, then 2.5 for a conjugate gaussian model.

The bottom line is that ppc["posterior_predictive"]["y_hat"].values.flatten()is a certain probability distribution, and ppc["posterior_predictive"]["y_hat"].mean(("chain","draw")).values is the expected value of that distribution, which happens to also be a random variable.

1 Like

This makes sense for sure. Thanks!

Yes, I will certainly look at those examples in BDA. I’ve read rethinking and other more applied Bayes books, but BDA was next on my list since I am ready for more technical discussion around Bayesian methods.

The bottom line is that ppc["posterior_predictive"]["y_hat"].values.flatten() is a certain probability distribution, and ppc["posterior_predictive"]["y_hat"].mean(("chain","draw")).values is the expected value of that distribution, which happens to also be a random variable.

This may be answered by the examples, but I guess my original question is what implications this has for predicting on a new observation. Does that suggest that if we take the mean of the predictive distribution for a new observation as our point estimate (if we must have a point), that the mean of that distribution will be within the variance of the mean PPC distribution? e.g., new samples would like under the black curve in my example rather than the orange one?

I am not sure one can make many general implications…

  • Your orange line (which is merely a quadratic approximation of the arviz orange dotted line; the latter is not in general normal) describes the distribution of your predictions (the actual probabilistic predictions, not just the point estimates)
  • Your black line describes the distribution of the means of your predictions; and in the specific case where the means are your point estimates it therefore describes the distribution of your point estimates

One obvious implication is that if your stakeholder consumes the black curve, believing it to be the orange curve, they would be overconfident

Another implication is that when one wants to make a decision it’s usually best to delay converting distributions to point estimates as late as practical.

For example imagine your predictions are players and you are trying to decide whom to hire and let’s say you run your model on two new players trying to decide which one is best. You get your ppc for Player 1 and your ppc for Player 2. You can take point estimates right now and then compare: is my mean ppc for Player 1 bigger than my mean ppc for Player 2? Or you could look at the difference of ppcs between Player 1 and Player 2 and take a point estimate of that (i.e. the mean difference). You’d probably arrive at the same conclusion but the latter will give you more information about how much of an arbitrary decision that was.

1 Like

In general, the general idea of carrying around appropriately propagated uncertainty until you absolutely must dispense with it is generally the most “Bayesian” thing you can do. When you must distill things down (e.g., when you must choose between discrete alternatives), then the general recommendation is to contemplate a domain-appropriate loss function and then you are into the realm of selecting a Bayesian estimator (e.g., to minimize expected loss). The mean is a choice of estimator, but it is a) not the only choice and b) has some interesting properties (as you have discovered). I think that sometimes people think taking means is a neutral, innocuous thing to do, but you are making a very specific choice and there are implications associated with that choice.

It might help to look at this blog post to see how a specific loss function is constructed and utilized (though the details may differ from your use-case).

1 Like

The posterior predictive distribution is still conditioned on X, the input data. If you don’t have any input data (say your model is just y ~ N(\mu, \sigma), so the posterior predictive distribution is just p(\hat y | y), integrating over \theta), then all of your predictions for new data would just be draws from the orange curve (in the arviz plot – I’m getting lost in all the curves and colors).

On the other hand, if you have data dependencies, those don’t get integrated out by the posterior predictive. Your predictive distribution will be p(\hat y | y, X). Dependency on \theta, the model parameters, is gone, but X is still there. So if you feed in new X, you’ll end up with a new (conditional) predictive distribution, which might or might not be inside the conditional distribution the training data implied.

As a simple example, consider forecasting with a deterministic trend model:

# Define model
dates = pd.date_range(start='1900-01-01', end='2022-01-01', freq='AS')
df = pd.Series(np.nan, index=dates, name='year')


coords = {'params':['intercept', 'slope']}
mutable_coords = {'year':df.index}

with pm.Model(coords=coords) as deterministic_trend:
    t = pm.MutableData('t', df.values - df.values.min(), dims=['year'])
    beta = pm.Normal('beta', mu=[100, 0], sigma=[10, 1],  dims='params')

    X = pt.concatenate([pt.ones_like(t)[:, None], t[:, None]], axis=-1)
    mu = pm.Deterministic('mu', X @ beta, dims=['year'])
    sigma = pm.HalfNormal("sigma",10)
    y_hat = pm.Normal("y_hat",mu=mu, sigma=sigma, shape=t.shape, dims=['year'])
    prior = pm.sample_prior_predictive()

# Sample a draw to serve as observed data, then recover the parameters
test_draw = prior.prior.isel(chain=0, draw=0)
df.loc[:] = test_draw.y_hat.values

with pm.observe(deterministic_trend, {'y_hat':test_draw.y_hat.values}):
    idata = pm.sample()
    idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True)

# Forecast a value 100 years into the future
forecast_date = df.index.shift(100)[-1]

with deterministic_trend:
    pm.set_data({'t':np.atleast_1d(forecast_date.year - df.index.year.min())}, 
                coords={'year':[forecast_date]})
    idata_pred = pm.sample_posterior_predictive(idata, var_names=['y_hat'])

The posterior predictive distribution for the forcasted year, mean or otherwise, is well outside the posterior predictive distribution of the training data, because of the dependency on the year:

fig, ax = plt.subplots()
az.plot_ppc(idata, ax=ax, legend=False)
az.plot_posterior(idata_pred.posterior_predictive, ax=ax)
ax.set(title='', xlabel='');

image

1 Like

Wow, thanks @jessegrabowski @cluhmann and @Genesis for your insightful comments, links, and examples. Great community!

the general idea of carrying around appropriately propagated uncertainty until you absolutely must dispense with it is generally the most “Bayesian” thing you can do

Or you could look at the difference of ppcs between Player 1 and Player 2 and take a point estimate of that (i.e. the mean difference). You’d probably arrive at the same conclusion but the latter will give you more information about how much of an arbitrary decision that was.

:point_up: @cluhmann & @Genesis These are great and definitely the approach I have used in the past. In some more recent problems, I encountered the issue of having to compare many distributions and it became less obvious so handle that w/o using finding some point estimate to represent each observation. Perhaps this type of approach :point_down: might be a way to help with this issue.

It might help to look at this blog post to see how a specific loss function is constructed and utilized (though the details may differ from your use-case)

Great reference. Thanks for suggesting.

The mean is a choice of estimator, but it is a) not the only choice and b) has some interesting properties (as you have discovered). I think that sometimes people think taking means is a neutral, innocuous thing to do, but you are making a very specific choice and there are implications associated with that choice.

This is definitely what started to make me explore this question a bit more. I think the idea of using the mean seems so straightforward, but as with all things it is easy to overlook the assumptions and consequences of a particular decision (estimator in this case).

Thanks @jessegrabowski for the detailed explanation of the posterior predictive distribution, and for clarifying the conditioning in each case. That definitely helps. The forecasting model makes perfect sense. It would be highly suspect if the value were in the distribution of the observed data, so the predicted values falling out side of the PPC makes sense.

I follow your code quite well since it is very clear. However, as an aside, I tried running your code in a notebook and I encountered this error, which I have seen multiple times before:

NotImplementedError: Masked arrays or arrays with `nan` entries are not supported. Pass them directly to `observed` if you want to trigger auto-imputation

Any suggestions on how to get past that? I see that you defined df to have nan values. It occurs early on at

 t = pm.MutableData('t', df.values - df.values.min(), dims=['year'])

EDIT: I am using pymc version 5.12.0

Ah yes sorry, this should be df.index.year - df.index.year.min(). I had the years as data at one point while I was working on the example.

1 Like