Out of sample predictions with missing data

Dear friends,

I am looking for a way to do missing value imputation for out of sample data predictions. I am checking the tutorial GLM-missing-values-in-covariates, and I have a question about Section 2.5.

When we do pm.sample_posterior_predictive in Section 2.5.2 it also randomly samples observed entries of xk, but shouldn’t we want to keep observed holdout entries constant and only random sample unobserved entries?

For example, if we would check an observed sample from the train data:

ida['posterior']['xk'][1,:,1,:]

we get:

array([[-0.74075617, 0.3284096 ],
[-0.74075617, 0.3284096 ],
[-0.74075617, 0.3284096 ],
[-0.74075617, 0.3284096 ]...

which are indeed the observed values in the second row of dfx_train. But if we check in Section 2.5.2:

ida_h['posterior_predictive']['xk'][1,:,0,:]

We would get:
array([[-1.05857949e-01, 1.04430642e-01],
[-1.73465653e+00, -1.12492631e-01],
[ 8.09007561e-01, -6.40969850e-01],
[-1.40680610e+00, 1.00091439e+00]...

But this row of the holdout data is observed. And if I understand the rest of the code correctly, these random samples are then used to sample y_hat, so observed data is not used to estimate y_hat.

So am I correct with my observation? Is there another way to do missing value imputation on OOS data?

I am currently thinking of a workaround where you create three models in total: One train model and two predictor models:

  1. Create a training model to get posterior samples of the coefficients from the training data.
  2. For OOS predictions, create a predictor model to get posterior samples of the OOS variables with missing data.
  3. Takeout the imputed variables and replace random samples of non-missing data entries with the observed data.
  4. Use another model to predict your response variable with the posterior samples of Step 1 and Step 3.

First off, I’d recommend finding a tutorial that doesn’t lead with the following disclaimer!

The theory and math may be incorrect, incorrectly notated, or incorrectly used.

If the goal is to do posterior predictive checks, then you want to resample all the y values based on the observed covariates x. It doesn’t matter how you’re doing the imputation.

If the goal is to do posterior predictive inference for new covariates, then you just want to sample over the posterior, then plug in those values of the parameters downstream to do inference on new outcomes from new covariates.

If you want to evaluate held out behavior using a hold-out set (rather than say by cross-validation), you just take the held out covariates as the ones to use for prediction. In this case, you can impute new values for the existing data while you’re at it—it won’t affect inference for the held out cases.

Thanks for the response. I want to evaluate the model on the hold-out set with the intention to do predictions with potentially missing values in the covariates. Missing covariates needs to be imputed and observed covariates are used as normal.