Trouble understanding sample_ppc


#1

Hello,

I’m having trouble understanding what’s going when pm.sample_ppc is called. I made some data generated by a simple linear regression and then drew samples from the posterior predictive distribution with this code:

with pm.Model() as model1:
    x     = pm.Uniform('x',lower = 0.0, upper = 10.0, observed = observed_x,shape = n)
    sigma = pm.HalfCauchy('sigma',beta = 1.0)
    beta  = pm.Normal('beta',sd = 5.0)
    alpha = pm.Normal('alpha',sd = 5.0)
    y_hat = beta * x + alpha
    y = pm.Normal('y',mu = y_hat,sd = sigma, shape = n,observed = observed_y)
    trace = pm.sample()
    ppc = pm.sample_ppc(trace, samples=1, model=model1)

Then, I plot the samples of Y versus the samples of X
plt.scatter(ppc['x'],ppc['y']) and I find that they are uncorrelated:
image1

Now, I was expecting it to instead look like this:
plt.scatter(observed_x,ppc['y'])
image2

In short, my question is this: why are the sampled X and Y values not consistent with regard to the model logic?


#2

I did not execute your code, but by looking at it there are some changes that you should make.

  • First you don’t have to define a probability distribution for x. I’m no expert, but it seems redundant.
  • In last line set the sample size to samples=10000. If you collect single sample, it will only draw a singles sample from the predictive distribution, which can be either closes to the mode of the predictive distribution or a point far away from that. If you collect sufficient number of samples then you can estimate the predictive distribution.
  • As I said, once you collect many samples for “y”, you have to consider the mode or mean of those samples for a single prediction.

The corrected code:

with pm.Model() as model1:
    sigma = pm.HalfCauchy('sigma',beta = 1.0)
    beta  = pm.Normal('beta', sd = 5.0)
    alpha = pm.Normal('alpha', sd = 5.0)
    y_hat = beta * x + alpha
    y = pm.Normal('y',mu = y_hat,sd = sigma, observed = observed_y)
    trace = pm.sample(1000)
    ppc = pm.sample_ppc(trace, samples=10000, model=model1)

Then plot the predictions using:

# here x is the generated dataset (should be one dimensional array)
plt.plot(x, ppc["y"].mean(axis=0))

You can also use the GLM module for most of the regression problems. And please follow this tutorial if you have further questions : https://docs.pymc.io/notebooks/GLM-linear.html


#3

Perhaps I should clarify my question a little - my goal with all of this is to generate simulated datasets of both the x and the y variable, and that’s why x has a prior distribution. You can see that the distribution of x from sample_ppc is fine, but the correlation with y isn’t carried through the model.


#4

The conditional in sample_ppc is not set up properly in this case - as a result the new y_hat is not conditioned on the newly generated x. The reason here is that, in sample_ppc the RVs only updated according to the inputted point (a dictionary containing one posterior sample), but not the newly generated values.

For now, the only way to achive what you want is writting your own ppc function to do posterior generation. I think this is something we should improve, could you raise an issue on Github?


#5

Done. Thanks for the help!