Dear friends,
I am looking for a way to do missing value imputation for out of sample data predictions. I am checking the tutorial GLM-missing-values-in-covariates, and I have a question about Section 2.5.
When we do pm.sample_posterior_predictive in Section 2.5.2 it also randomly samples observed entries of xk, but shouldn’t we want to keep observed holdout entries constant and only random sample unobserved entries?
For example, if we would check an observed sample from the train data:
ida['posterior']['xk'][1,:,1,:]
we get:
array([[-0.74075617, 0.3284096 ],
[-0.74075617, 0.3284096 ],
[-0.74075617, 0.3284096 ],
[-0.74075617, 0.3284096 ]...
which are indeed the observed values in the second row of dfx_train. But if we check in Section 2.5.2:
ida_h['posterior_predictive']['xk'][1,:,0,:]
We would get:
array([[-1.05857949e-01, 1.04430642e-01],
[-1.73465653e+00, -1.12492631e-01],
[ 8.09007561e-01, -6.40969850e-01],
[-1.40680610e+00, 1.00091439e+00]...
But this row of the holdout data is observed. And if I understand the rest of the code correctly, these random samples are then used to sample y_hat, so observed data is not used to estimate y_hat.
So am I correct with my observation? Is there another way to do missing value imputation on OOS data?
I am currently thinking of a workaround where you create three models in total: One train model and two predictor models:
- Create a training model to get posterior samples of the coefficients from the training data.
- For OOS predictions, create a predictor model to get posterior samples of the OOS variables with missing data.
- Takeout the imputed variables and replace random samples of non-missing data entries with the observed data.
- Use another model to predict your response variable with the posterior samples of Step 1 and Step 3.