 # Sample_posterior_predictive dependent variable vs deterministic dependent variable

Hello,

I am following this tutorial: https://github.com/PacktPublishing/Bayesian-Analysis-with-Python-Second-Edition/blob/master/Chapter03/03_Modeling%20with%20Linear%20Regressions.ipynb , and I would like to understand the difference between sampling your dependent variables using `sample_posterior_predictive` and simply extracting your dependent variable (`µ` below) from the trace (`trace_g` below)

You can use this code block to set your environment (taken from the link above)

``````import pymc3 as pm
import numpy as np
import pandas as pd
from theano import shared
import scipy.stats as stats
import matplotlib.pyplot as plt
import arviz as az

np.random.seed(1)
N = 100
alpha_real = 2.5
beta_real = 0.9
eps_real = np.random.normal(0, 0.5, size=N)

x = np.random.normal(10, 1, N)
y_real = alpha_real + beta_real * x
y = y_real + eps_real

with pm.Model() as model_g:
α = pm.Normal('α', mu=0, sd=10)
β = pm.Normal('β', mu=0, sd=1)
ϵ = pm.HalfCauchy('ϵ', 5)
μ = pm.Deterministic('μ', α + β * x)
y_pred = pm.Normal('y_pred', mu=μ, sd=ϵ, observed=y)
trace_g = pm.sample(2000, tune=1000, chains = 2)
``````

Now we can sample the posterior predictive like this

``````ppc = pm.sample_posterior_predictive(trace_g, samples=4000, model=model_g)
sample_df = pd.DataFrame(ppc['y_pred'])
``````

And get the `µ`'s like this

``````trace_df = pm.trace_to_dataframe(trace_g)
#it so happens only µ columns have '_', this wouldn't always work
trace_df = df_test.loc[:, [col for col in list(trace_df.columns) if '_' in col]]
``````

Now `trace_df.shape = sample_df.shape = (4000,100)`, and we can define a function to calculate the hpd for each of 100 samples.

``````def calc_hpds(df, true_y):
pred_hpds = []
for i in list(df.columns):
hpd = pm.hpd(df.loc[:, i])
avg = df.loc[:, i].mean()
pred_hpds.append({"lower": hpd, "upper": hpd, "mean": avg})

hpds_df = pd.DataFrame(pred_hpds)
hpds_df["true_y"] = true_y

return hpds_df
``````

Using `calc_hpds`, we can see the `lower`, `upper`, and `mean` values are not the same.

``````#outputs a dataframe, I was just looking at it in jupyter
calc_hpds(trace_df, y)

#outputs a 2nd dataframe
calc_hpds(sample_df, y)
``````

Obviously I am not expecting the values to literally be the exact same, but with 4000 replicates I was thinking the values would line up much better. Why don’t the values line up? Should the two methods produce the same result and I have made a mistake in my calculations, or is using the trace wrong approach, and hopefully someone can explain why and set me straight.

-Kelley

You are not getting `µ` from the `sample_posterior_predictive`, instead it’s `y_pred` - which is `µ` convoluted with a Gaussian with `sigma=ϵ`. You would get `calc_hpds(sample_df, y)` being larger range.

2 Likes

Hi!
Yeah as Junpeng said: `y_pred` is your likelihood, while `µ` is the expected value of this likelihood. As a result, `y_pred` has another layer of uncertainty, brought by its `sigma`.
Also, note that you can use `arviz.hpd` for your plots and diagnostics.

2 Likes

Interesting, thank you. Is this still the case in a logistic model where there is no `sigma=ϵ`, something like:

``````with pm.Model() as ret_risk_long:
ppv = (13/164)
sens = (13/21)
a = Normal('a', mu = 0.0, sigma = 2)
b = Normal("b", mu=0.0, sigma=1, shape=3)
y_hat = a + b * indep_1 + b * indep2 + b * indep_3
theta = pm.math.sigmoid(y_hat)
theta = pm.Deterministic('theta', theta)
y_like = pm.Bernoulli('y_like', p = theta, observed = dependent)
prev = pm.Deterministic('prevalence', ppv*theta/sens)
long_gt1177_trace = pm.sample(2000, tune=1000, chains=2)

``````

Where `theta` is analogous to `µ`. Can anything interesting be done with the `theta`? For example something similar to a prediction interval that you can get using `sample_posterior_predictive` rather than the regression mean and hpd that one can get/plot from the coefficient estimates in the trace. This would be analogous to the difference between `` and `` in the example notebook.

Thank you

That would essentially be what you find in https://docs.pymc.io/notebooks/posterior_predictive.html
It is in general still not the same comparing the uncertainty in the latent variable and the uncertainty in the posterior predictive samples of the observed.

2 Likes