Given a trace of a fitted model, i would like to make predictions on new data using the posterior predictive function. It’s a regression problem, using a GAM. However in all examples [spline example] [classification example] i’ve seen, sampling from the posterior predictive function requires to provide values for the target variable Y, which i don’t have.
Is there a way to get predictions for arbitrary explanatory variable values without the need to construct any target values? I just want to predict Y given X. Just multiplying the posterior estimates of the parameters with the data doesn’t seem right, or is that the way to do it?
You don’t need targets for out of sample prediction. You just need to use pm.set_data to change the input data, then call pm.sample_posterior_predictive. The tutorial on pm.data might be helpful.
You’re overanalyzing the Y requirement; it has nothing to do with the need for goals at prediction time; it’s simply the way the model graph was constructed.
In PyMC, posterior predictive sampling just requires the symbolic variable to be present in the graph; it does not “use” observed Y values for fresh data. It may seem counterintuitive, but examples still determine Y because when you transition to a new X, the probability node is assessed forward using the sampled parameters rather than conditioned on Y.
The proper procedure is:
X should be defined as shared data (pm.Data).
Fit the GAM once.
change X using pm.set_data to reflect the new values.
Sample_posterior_predictive is called.
No phony Y structure, no false targets. Y will be automatically generated by PyMC from the posterior predictive distribution.
Indeed, it is incorrect to manually multiply posterior means by X in this situation since it ignores the GAM’s nonlinear spline structure and compresses uncertainty. The goal of posterior predictive analysis is to spread noise and parameter uncertainty.
The simple answer is that you can definitely predict Y given just X, which is precisely what posterior predictive analysis is for. Code that “requires” Y is not a logical necessity; rather, it is a modeling artifact.
with model1:
post_pred = pm.sample_posterior_predictive(trace, var_names=['mu'])
y_pred = post_pred.posterior_predictive['mu'].mean(dim=('chain','draw')).to_dataframe()['mu']
You can then make any new dataframe you want for X as long as it has the same columns in the same order (I use pandas dataframes. I’m not sure if the names have to be exactly the same but it’s probably best that they are.) In my case I am using a latin hypercube called X_lhs: