# How do I predict on new, unseen real data using pm.sample_posterior_predictive?

Hello! I am trying to do a simple multivariate regression using bayesian modeling. I am using real data from a CSV table. I am able to set up the model and sample from posterior, but I am confused with how to actually generate new predictions from new Xi data.

My training data have one Y (output) and 10 Xi input (i = 1 to 10). All X predictors are standardize.

I specified the parameters:

dY : Y output data

dX1 : 1st X column data
dX2 : 2nd X column data

dX10 : 10th X column data

My model:

``````with pm.Model() as model:
a = pm.Normal('a', mu=dY.mean(), sd=10)
B = pm.Normal('B', mu=0, sd=10, shape=10)
sigma = pm.Uniform('sigma', lower=0, upper=10)
mu = pm.Deterministic('mu', a + B[0] * dX1 + B[1] * dX2 + B[2] * dX3 + B[3] * dX4 +
B[4] * dX5 + B[5] * dX6 + B[6] * dX7 + B[7] * dX8 + B[8] * dX9 + B[9] * X10)

Y = pm.Normal('Y', mu=mu, sd=sigma, observed=dY)
trace = pm.sample(1000, tune=1000)
``````
``````When I use:

> Y_pred = pm.sample_posterior_predictive(trace, samples=1000, model=model)['Y']
``````

I have all the Y_pred values generated by the model from the Xi original data.

If I wanted to predict new Y values from new Xi parameters? How should I use pm.sample_posterior_predictive?

2 Likes

You can use `set_data()` to swap out the data you used for inference for something new (e.g., out-of-sample test data) before running `sample_posterior_predictive`. That will allow you to use your estimated model parameters to generate predictions about your outcome (i.e., `Y` in your case) in a new scenario (i.e., for new values of `dX1`, `dX2`, etc.).

This notebook may be of additional use to you.

[Edit: documentation links updated]

6 Likes

Thanks for this. I am working on preposterior analysis and was trying to figure out dealing with hypothetical data for value of information analysis using pymc3.

Isa it always good practice to use `pm.Data` to make the model data-aware? Don’t see it being done much though.

1 Like

Hi Colin,
Yep, it’s usually the first thing to try. There definitely are limitations to the `Data` container, but being able to use it makes everything easier.
You can take at look this notebook for an introduction, and at this one for many examples.
Hope this helps

1 Like

Fantastic - thanks @AlexAndorra, et joyeux noël!

2 Likes

Ha ha thanks, you too – and thanks for the support on the podcast

2 Likes

My own use of the data container strongly depends on the model, the data, and the overall scenario. For simple models/data set-ups and/or when I am generating posterior predictions for quick diagnostic purposes, I often just plug samples into a “new” instantiation of the model. But as things get more complex, the data container starts to be much more convenient because you can re-use the model you already implemented.

2 Likes

Thanks for your answer cluhmann!!
But when I tested your example at my Spyder, I have the alert:

AttributeError: module ‘pymc3’ has no attribute ‘set_data’

What happen? Is it a PyMC3 version problem?
Thank you very much!!

Possible, though it seems unlikely. What version of pymc3 are you using? And can you provide a snippet of code where `set_data()` fails for you?

Hello cluhmann, sorry for the delay! I am using the same example that appears at the link: “https://docs.pymc.io/api/model.html#pymc3.model.set_data”:

``````import pymc3 as pm
print(f"Running on PyMC3 v{pm.__version__}")

with pm.Model() as model:
x = pm.Data('x', [1., 2., 3.])
y = pm.Data('y', [1., 2., 3.])
beta = pm.Normal('beta', 0, 1)
obs = pm.Normal('obs', x * beta, 1, observed=y)
trace = pm.sample(1000, tune=1000)

with model:
pm.set_data({'x': [5., 6., 9.]})
y_test = pm.sample_posterior_predictive(trace)
y_test['obs'].mean(axis=0)
``````

The output shows:

``````Running on PyMC3 v3.6

File "C:/Backup_Fernando/DeepLearning/Spyder/teste_set_data2.py", line 14, in <module>
x = pm.Data('x', [1., 2., 3.])

AttributeError: module 'pymc3' has no attribute 'Data'
``````

How can I fix this?
Thanks a lot for any help!!

Hm. 3.6 is now 2 years old, so it might actually be old enough to not include the `pm.Data`/`pm.set_data()` functionality. Unless you have some particular reason not to, I would update (or install a fresh copy of 3.10 in a new virtual environment).

Thank you very much Christian!!
I will update my enviroment!

1 Like

Dear Christian, do you have any idea to use the `pm.set_data()` with data frame table?
Thank you!

A pandas dataframe? Something like this should work:

``````    pm.set_data( {'my_observed_variable': df['my_column'].to_numpy()} )
``````
1 Like