How do I predict on new, unseen real data using pm.sample_posterior_predictive?

Hello! I am trying to do a simple multivariate regression using bayesian modeling. I am using real data from a CSV table. I am able to set up the model and sample from posterior, but I am confused with how to actually generate new predictions from new Xi data.

My training data have one Y (output) and 10 Xi input (i = 1 to 10). All X predictors are standardize.

I specified the parameters:

dY : Y output data

dX1 : 1st X column data
dX2 : 2nd X column data

dX10 : 10th X column data

My model:

with pm.Model() as model:
a = pm.Normal('a', mu=dY.mean(), sd=10)
B = pm.Normal('B', mu=0, sd=10, shape=10) 
sigma = pm.Uniform('sigma', lower=0, upper=10) 
mu = pm.Deterministic('mu', a + B[0] * dX1 + B[1] * dX2 + B[2] * dX3 + B[3] * dX4 + 
                      B[4] * dX5 + B[5] * dX6 + B[6] * dX7 + B[7] * dX8 + B[8] * dX9 + B[9] * X10)

Y = pm.Normal('Y', mu=mu, sd=sigma, observed=dY)
trace = pm.sample(1000, tune=1000)
When I use:

> Y_pred = pm.sample_posterior_predictive(trace, samples=1000, model=model)['Y']

I have all the Y_pred values generated by the model from the Xi original data.

If I wanted to predict new Y values from new Xi parameters? How should I use pm.sample_posterior_predictive?

2 Likes

You can use set_data() to swap out the data you used for inference for something new (e.g., out-of-sample test data) before running sample_posterior_predictive. That will allow you to use your estimated model parameters to generate predictions about your outcome (i.e., Y in your case) in a new scenario (i.e., for new values of dX1, dX2, etc.).

This notebook may be of additional use to you.

[Edit: documentation links updated]

6 Likes

Thanks for this. I am working on preposterior analysis and was trying to figure out dealing with hypothetical data for value of information analysis using pymc3.

Isa it always good practice to use pm.Data to make the model data-aware? Don’t see it being done much though.

1 Like

Hi Colin,
Yep, it’s usually the first thing to try. There definitely are limitations to the Data container, but being able to use it makes everything easier.
You can take at look this notebook for an introduction, and at this one for many examples.
Hope this helps :vulcan_salute:

1 Like

Fantastic - thanks @AlexAndorra, et joyeux noël! :christmas_tree: :tada:

2 Likes

Ha ha thanks, you too – and thanks for the support on the podcast :wink:

2 Likes

My own use of the data container strongly depends on the model, the data, and the overall scenario. For simple models/data set-ups and/or when I am generating posterior predictions for quick diagnostic purposes, I often just plug samples into a “new” instantiation of the model. But as things get more complex, the data container starts to be much more convenient because you can re-use the model you already implemented.

2 Likes

Thanks for your answer cluhmann!!
But when I tested your example at my Spyder, I have the alert:

AttributeError: module ‘pymc3’ has no attribute ‘set_data’

What happen? Is it a PyMC3 version problem?
Thank you very much!!

Possible, though it seems unlikely. What version of pymc3 are you using? And can you provide a snippet of code where set_data() fails for you?

Hello cluhmann, sorry for the delay! I am using the same example that appears at the link: “https://docs.pymc.io/api/model.html#pymc3.model.set_data”:

import pymc3 as pm
print(f"Running on PyMC3 v{pm.__version__}")

with pm.Model() as model:
    x = pm.Data('x', [1., 2., 3.])
    y = pm.Data('y', [1., 2., 3.])
    beta = pm.Normal('beta', 0, 1)
    obs = pm.Normal('obs', x * beta, 1, observed=y)
    trace = pm.sample(1000, tune=1000)
    
        
with model:
    pm.set_data({'x': [5., 6., 9.]})
    y_test = pm.sample_posterior_predictive(trace)
    y_test['obs'].mean(axis=0)

The output shows:

Running on PyMC3 v3.6

File "C:/Backup_Fernando/DeepLearning/Spyder/teste_set_data2.py", line 14, in <module>
    x = pm.Data('x', [1., 2., 3.])

AttributeError: module 'pymc3' has no attribute 'Data'

How can I fix this?
Thanks a lot for any help!!

Hm. 3.6 is now 2 years old, so it might actually be old enough to not include the pm.Data/pm.set_data() functionality. Unless you have some particular reason not to, I would update (or install a fresh copy of 3.10 in a new virtual environment).

Thank you very much Christian!!
I will update my enviroment!

1 Like

Dear Christian, do you have any idea to use the pm.set_data() with data frame table?
Thank you!

A pandas dataframe? Something like this should work:

    pm.set_data( {'my_observed_variable': df['my_column'].to_numpy()} )
1 Like