Dear all,
I have seen that there are examples (here, here and here) and discussions (here, here) about posterior predictive sampling. However, I now realized I did mistakes when copying some example and would like to ensure that my humble understanding of the prediction process is correct.
My goal ultimately is to sample from a linear, hierarchical model for in-sample and out-of-sample prediction. It is relevant for my case that the prediction has the same variability as the data.
Here is a toy data set that I played around with: prediction_trouble.py (6.7 KB)
My questions are:
I. How to do prediction properly, i.e. at what point in the modeling process (see below)?
II. Is it right to randomly sample from the predicted values, along the extra axis that represents the number of observations?
III. How does the variability in the prediction relate to data variability and model residual?
Thank you for taking the time!
Best,
Falk
I. Prediction Timing
So my first “aha!” moment was that, instead of using the “trace” in sample_posterior_predictive
, I could also use MAP estimates. Yet this raised the question of when to extract the MAP (it depends on the model, obviously). The model process for me consists of (i) model construction, (ii) sampling, (iii) fixing values with theano.shared.set_value()
. So there are four possible inputs to sample_posterior_predictive
:
\quad b. find_MAP
after model construction, so it should be the “prior-MAP”
\quad a. the trace generated by pm.sample()
\quad c. find_MAP
after sampling
\quad d. find_MAP
after fixing.
And, because the internal state of the model changes in the process, and the model is used for predictive sampling, there are time points at which to do prediction.
\quad 0. after model construction
\quad 1. post sampling
\quad 2. post fixing
So I’ve tried all of these (see file), and figured that (2c) is giving the desired output. This means I will use MAP from prior to fixing, but the model from after having set shared variables. Conceptually, I think this makes sense in my case. But is it correct*?
^* I am sceptical, because for example, (2d) did not work and throws
ValueError: Input dimension mis-match. (input[0].shape[0] = 1001, input[1].shape[0] = 1).
II. Capturing Variance
So far, so good. Now I noticed that the prediction always comes with an extra dimension axis, one that is of length equal to the number of observations in the data. This is independent of whether I use trace
or MAP
dict.
Naively copying some example, I averaged over that axis to get an output that is exactly as long as the n samples
I specified in sample_posterior_predictive
.
There comes my second novice revelation: averaging here gives me averages, thus I loose the variability that I am interested in.
I read up and understand that the extra dimension is necessary because the prediction depends on other data parameters, for example on the hierarchical subset in multilevel models. Good, so pymc design is perfectly reasonable, which I never doubted.
Now, to get back to a prediction of size n, without sacrificing variability, I decided to randomly choose one of the predictions along the extra axis. Can be done with
lambda prediction: np.array([ prediction[idx, sel] \
for idx, sel in enumerate(\
np.random.randint(0, prediction.shape[1], prediction.shape[0]) \
) \
])
In other words, I randomly choose one of the observations of my data set for prediction on each prediction sample. This yields the shape I desire, retains the variability from the model, and given enough samples
, my data set will be well reflected.
Does that make sense?
III. Model Residual
Now the linear model contains an epsilon, which is the model residual. Nicely, in the toy example, it matches the residual structure that was simulated, and it also reflects in the variability of the posterior prediction.
Let’s assume this residual is independent of other predictors (as designed here for simplicity), i.e. homogeneous across the data set. In that case, I cannot tell whether the prediction variability arises from the data variability, or from the model result for epsilon.
How do they relate?
Obviously, in more complex cases such as scaling variance, I would have to adjust the model.