Trouble understanding sample_ppc

ckrapu · June 6, 2018, 11:32pm

Hello,

I’m having trouble understanding what’s going when pm.sample_ppc is called. I made some data generated by a simple linear regression and then drew samples from the posterior predictive distribution with this code:

with pm.Model() as model1:
    x     = pm.Uniform('x',lower = 0.0, upper = 10.0, observed = observed_x,shape = n)
    sigma = pm.HalfCauchy('sigma',beta = 1.0)
    beta  = pm.Normal('beta',sd = 5.0)
    alpha = pm.Normal('alpha',sd = 5.0)
    y_hat = beta * x + alpha
    y = pm.Normal('y',mu = y_hat,sd = sigma, shape = n,observed = observed_y)
    trace = pm.sample()
    ppc = pm.sample_ppc(trace, samples=1, model=model1)

Then, I plot the samples of Y versus the samples of X
plt.scatter(ppc['x'],ppc['y']) and I find that they are uncorrelated:

Now, I was expecting it to instead look like this:
plt.scatter(observed_x,ppc['y'])

In short, my question is this: why are the sampled X and Y values not consistent with regard to the model logic?

Nadheesh · June 7, 2018, 4:44am

I did not execute your code, but by looking at it there are some changes that you should make.

First you don’t have to define a probability distribution for x. I’m no expert, but it seems redundant.
In last line set the sample size to samples=10000. If you collect single sample, it will only draw a singles sample from the predictive distribution, which can be either closes to the mode of the predictive distribution or a point far away from that. If you collect sufficient number of samples then you can estimate the predictive distribution.
As I said, once you collect many samples for “y”, you have to consider the mode or mean of those samples for a single prediction.

The corrected code:

with pm.Model() as model1:
    sigma = pm.HalfCauchy('sigma',beta = 1.0)
    beta  = pm.Normal('beta', sd = 5.0)
    alpha = pm.Normal('alpha', sd = 5.0)
    y_hat = beta * x + alpha
    y = pm.Normal('y',mu = y_hat,sd = sigma, observed = observed_y)
    trace = pm.sample(1000)
    ppc = pm.sample_ppc(trace, samples=10000, model=model1)

Then plot the predictions using:

# here x is the generated dataset (should be one dimensional array)
plt.plot(x, ppc["y"].mean(axis=0))

You can also use the GLM module for most of the regression problems. And please follow this tutorial if you have further questions : https://docs.pymc.io/notebooks/GLM-linear.html

ckrapu · June 7, 2018, 2:28pm

Perhaps I should clarify my question a little - my goal with all of this is to generate simulated datasets of both the x and the y variable, and that’s why x has a prior distribution. You can see that the distribution of x from sample_ppc is fine, but the correlation with y isn’t carried through the model.

junpenglao · June 7, 2018, 3:05pm

The conditional in sample_ppc is not set up properly in this case - as a result the new y_hat is not conditioned on the newly generated x. The reason here is that, in sample_ppc the RVs only updated according to the inputted point (a dictionary containing one posterior sample), but not the newly generated values.

For now, the only way to achive what you want is writting your own ppc function to do posterior generation. I think this is something we should improve, could you raise an issue on Github?

ckrapu · June 7, 2018, 4:06pm

Done. Thanks for the help!

Topic		Replies	Views
Posterior Predictive Checks Questions	1	604	May 19, 2019
Sample_ppc shape Questions	5	2669	July 18, 2018
Model Fits with no issues but getting an error when sampling ppc v5 modeling	4	376	July 11, 2023
Prediction using sample_ppc in Hierarchical model Questions from_github	6	4973	December 14, 2017
How to use the posterior predictive distribution for checking a model from PyMC version agnostic arviz , model-checking	10	4183	March 14, 2023

Trouble understanding sample_ppc

Related topics