Posterior predictive sampling with data variance


#1

Dear all,

I have seen that there are examples (here, here and here) and discussions (here, here) about posterior predictive sampling. However, I now realized I did mistakes when copying some example and would like to ensure that my humble understanding of the prediction process is correct.

My goal ultimately is to sample from a linear, hierarchical model for in-sample and out-of-sample prediction. It is relevant for my case that the prediction has the same variability as the data.

Here is a toy data set that I played around with: prediction_trouble.py (6.7 KB)

My questions are:
I. How to do prediction properly, i.e. at what point in the modeling process (see below)?
II. Is it right to randomly sample from the predicted values, along the extra axis that represents the number of observations?
III. How does the variability in the prediction relate to data variability and model residual?

Thank you for taking the time!
Best,

Falk

I. Prediction Timing

So my first “aha!” moment was that, instead of using the “trace” in sample_posterior_predictive, I could also use MAP estimates. Yet this raised the question of when to extract the MAP (it depends on the model, obviously). The model process for me consists of (i) model construction, (ii) sampling, (iii) fixing values with theano.shared.set_value(). So there are four possible inputs to sample_posterior_predictive:
\quad b. find_MAP after model construction, so it should be the “prior-MAP”
\quad a. the trace generated by pm.sample()
\quad c. find_MAP after sampling
\quad d. find_MAP after fixing.

And, because the internal state of the model changes in the process, and the model is used for predictive sampling, there are time points at which to do prediction.
\quad 0. after model construction
\quad 1. post sampling
\quad 2. post fixing

So I’ve tried all of these (see file), and figured that (2c) is giving the desired output. This means I will use MAP from prior to fixing, but the model from after having set shared variables. Conceptually, I think this makes sense in my case. But is it correct*?

^* I am sceptical, because for example, (2d) did not work and throws

ValueError: Input dimension mis-match. (input[0].shape[0] = 1001, input[1].shape[0] = 1).

II. Capturing Variance

So far, so good. Now I noticed that the prediction always comes with an extra dimension axis, one that is of length equal to the number of observations in the data. This is independent of whether I use trace or MAP dict.
Naively copying some example, I averaged over that axis to get an output that is exactly as long as the n samples I specified in sample_posterior_predictive.
There comes my second novice revelation: averaging here gives me averages, thus I loose the variability that I am interested in.

I read up and understand that the extra dimension is necessary because the prediction depends on other data parameters, for example on the hierarchical subset in multilevel models. Good, so pymc design is perfectly reasonable, which I never doubted.
Now, to get back to a prediction of size n, without sacrificing variability, I decided to randomly choose one of the predictions along the extra axis. Can be done with

lambda prediction: np.array([ prediction[idx, sel] \
                              for idx, sel in enumerate(\
                                    np.random.randint(0, prediction.shape[1], prediction.shape[0]) \
                              ) \
                            ])

In other words, I randomly choose one of the observations of my data set for prediction on each prediction sample. This yields the shape I desire, retains the variability from the model, and given enough samples, my data set will be well reflected.

Does that make sense?

III. Model Residual

Now the linear model contains an epsilon, which is the model residual. Nicely, in the toy example, it matches the residual structure that was simulated, and it also reflects in the variability of the posterior prediction.
Let’s assume this residual is independent of other predictors (as designed here for simplicity), i.e. homogeneous across the data set. In that case, I cannot tell whether the prediction variability arises from the data variability, or from the model result for epsilon.

How do they relate?
Obviously, in more complex cases such as scaling variance, I would have to adjust the model.


#2

Regarding question 1: my understanding is that using find_MAP is not recommended. I remember reading Thomas Wiecki saying that find_MAP is only there for historical reasons. You do prediction with sample_ppc (or sample_posterior_predictive in the future). If you need the expected value of the prediction, take a mean of your generated samples. If you need to look at the variance/uncertainty of the prediction, take the variance or take quantiles from your samples, etc. This relates to your question 2.


#3

Thank you @simon_o for the comment!

Taking the mean is not useful: I’m generating temporal traces from Fourier Series components, so (temporal) averaging will lead to amplitude loss (as in interference). Hence, reconstructions from averages will be wrong. I actually want to see variable traces, not the variability. Thinking about it, the MAP value would also be only a particular value, so maybe it is not better than the mean.
However, the mean of a distribution can be rather meaningless, if the distribution is skewed (possible in my case). So the peak of the density curve would be much better. So I think find_MAP might still be useful, keeping in mind that it might fail on multimodal cases (not the case for me).

I can imagine @twiecki mentioned find_MAP deprecation in the context of NUTS initialization? that’s where I read it. It may be useful with other samplers or, as here, for prediction. If it is only there for historical reasons, wouldn’t it be good to throw a warning?

How about the relation to epsilon? If I take the MAP, is the variability I see only from the model, and does it also come from the data if I pass a trace? Quite confusing, still.


#4

That’s an interesting way to think of it… maybe it’s my finance background, but if I were offered to make a bet where the expected payoff is 5$ (i.e. the expected value), even though the most probable value (peak of the probability density) is -5$, I would take the bet. If instead the expected value were negative, but the peak of the density curve were positive, I would not. So I still think that even in the example you discuss, expected value is a more meaningful value than where that peak probability is.

In any case, I understand that for your specific problem you need a more complex representation of the predictions. I’m not clear, however, what is that more complex representation you’re looking for. A prediction cone from quantiles on the sample_ppc output is not enough? Visualizing the samples themselves is not precise enough? Is it that you need to calculate some measure of goodness of fit to tweak your model?

Regarding @twiecki’s comment about find_MAP, perhaps you’re right and I misinterpreted, so I’ll abstain from commenting further about this: I don’t want to be putting words in his mouth.


#5

In fact, both would be enough. Problem is, if they were all somehow attracted to or biased towards the mean (because samples are taken with either the covariate values, or the slope/intercept distributions being fixed to a point), then the samples will vary less than the model.

No, neither is the case. I am well able to produce any representation from my data. Goodness of fit is okay, except that the predictions always vary less than the actual data.
This might be due to shrinkage in my actual, hierarchical model. So to exclude that, I played with the test case in the file linked above. This is how the questions emerged.


#6

Ah, I see. You’re referring to the code you posted in prediction_trouble.py? Have a look at the following: prediction_troubleV2.py (1.5 KB) which is an abridged version of your code. I sample from the posterior of your model using sample_ppc (in my version of PyMC3 sample_posterior_predictive isn’t valid syntax yet), and show the standard deviation of my predictions, which is the same as the standard deviation of the observations. Are we talking about the same thing, or you meant something else?


#7

Yes, thanks @simon_o for checking. In fact, the standard deviation match was already confirmed in the original file, and I was glad that the test is positive.
This brings us back to my original questions:

  • How to reduce the extra dimension in the sampling output? (averaging is no option)
  • How about the use of find_MAP(), when to take it, comparison to using trace?
  • How does the prediction variability relate to model residual?

:wink: No offense, I really appreciate your time and help!


#8

Oh, sorry. My first reaction was to aggressively cut down on the big wall of code there, to focus on a very specific thing… I didn’t notice that variance comparison. So what do you mean by “except that the predictions always vary less than the actual data”?

Regarding find_MAP, my thoughts on it are to just forget it exists. To quote the PyMC3 documentation: “In summary, while PyMC3 provides the function find_MAP() , at this point mostly for historical reasons, this function is of little use in most scenarios. If you want a point estimate you should get it from the posterior.”

That extra dimension in the sampling output is the number of posterior samples for each input data point. So, the answer to your question depends on what is the kind of answer your are looking for. There isn’t one specific answer to that. If you care about confidence intervals, for example, you can try taking quantiles (or percentiles) from them.


#9

I referred to my hierarchical data, sorry if that was not clear. I want to exclude any reasons besides shrinkage.
For example, if I feed a point estimate (MAP or mean or median or quantile or any single point) to the sample_posterior_predictive, as opposed to the trace, I guessed that the sampling will be affected. If not, why?

find_MAP() is still used in the documentation ( here and here , as linked initially). Thus, if you write “of little use in most scenarios”, what is “little” and what “most”? I would like to understand, not just follow blindly.

As written in the original post, I need an array that has the extra dimension somehow collapsed, but without reducing variability. Randomly choosing one element along that axis, which I suggested, seems to work, but I wanted to make sure if I understood everything correctly.

And to understand that, I thought it’s good to know how the model residual affects the prediction.


#10

I did not follow the discussion, but here are some thoughts of your main post:

I.

I think you have some miss-conception about sampling. When you write down your model (i.e., put all the information under a with pm.Model() as ... context), your parameter space is fixed. Inference is then extracting information from this space, for example, MAP that represents the (hopefully global) maximum of the space, MCMC samples represents the IID samples from the space that you can plug into another function and compute expectation.

In that regard, there is no distinction between MAP before or after samples. In another word, there is no prior or posterior MAP - as sampling does not change the find_MAP result.

Now combine this with theano.shared variables in a linear model. In this case if you set different values to the theano.shared variable your MAP result would change, because you are in a different parameter space now. However, this has little to do with posterior prediction. Fixing to a specific value to get posterior prediction is equivalent to sampling from a conditional probability distribution - you conditioned on the posterior and the new inputs

So to recap: internal state of the model does not change before or after sampling, but it does change after you fixed the input X, y value post sampling (i.e., post fixing).

II.

Depending on what is your aim, you can of course compute the average along the axis of n-samples, which gives you the expectation of the posterior prediction, or you can work with the raw sample directly. In the later case, it means for each new observation (X_i', y_i'), you have samples from y_i' \sim \pi(X_i' \mid posterior)

III.

There is a way to distinguish the residual from the parameter and linear prediction. You can get the posterior of \beta and dot multiply it with the new X_i', which would gives you the linear prediction with uncertainty. And plug that in along with sd into a Normal distribution and generate samples (essentially what sample_posterior_prediction is doing) you get the posterior prediction with model residual. See also discussion here: Uncertainty of Model Predictions


#11

Thank you, @junpenglao!
I think I understand this now. Sorry for missing the discussion of uncertainty you linked.