Hello,

I am following this tutorial: https://github.com/PacktPublishing/Bayesian-Analysis-with-Python-Second-Edition/blob/master/Chapter03/03_Modeling%20with%20Linear%20Regressions.ipynb , and I would like to understand the difference between sampling your dependent variables using `sample_posterior_predictive`

and simply extracting your dependent variable (`µ`

below) from the trace (`trace_g`

below)

**You can use this code block to set your environment (taken from the link above)**

```
import pymc3 as pm
import numpy as np
import pandas as pd
from theano import shared
import scipy.stats as stats
import matplotlib.pyplot as plt
import arviz as az
np.random.seed(1)
N = 100
alpha_real = 2.5
beta_real = 0.9
eps_real = np.random.normal(0, 0.5, size=N)
x = np.random.normal(10, 1, N)
y_real = alpha_real + beta_real * x
y = y_real + eps_real
with pm.Model() as model_g:
α = pm.Normal('α', mu=0, sd=10)
β = pm.Normal('β', mu=0, sd=1)
ϵ = pm.HalfCauchy('ϵ', 5)
μ = pm.Deterministic('μ', α + β * x)
y_pred = pm.Normal('y_pred', mu=μ, sd=ϵ, observed=y)
trace_g = pm.sample(2000, tune=1000, chains = 2)
```

Now we can sample the posterior predictive like this

```
ppc = pm.sample_posterior_predictive(trace_g, samples=4000, model=model_g)
sample_df = pd.DataFrame(ppc['y_pred'])
```

And get the `µ`

's like this

```
trace_df = pm.trace_to_dataframe(trace_g)
#it so happens only µ columns have '_', this wouldn't always work
trace_df = df_test.loc[:, [col for col in list(trace_df.columns) if '_' in col]]
```

Now `trace_df.shape = sample_df.shape = (4000,100)`

, and we can define a function to calculate the hpd for each of 100 samples.

```
def calc_hpds(df, true_y):
pred_hpds = []
for i in list(df.columns):
hpd = pm.hpd(df.loc[:, i])
avg = df.loc[:, i].mean()
pred_hpds.append({"lower": hpd[0], "upper": hpd[1], "mean": avg})
hpds_df = pd.DataFrame(pred_hpds)
hpds_df["true_y"] = true_y
return hpds_df
```

Using `calc_hpds`

, we can see the `lower`

, `upper`

, and `mean`

values are not the same.

```
#outputs a dataframe, I was just looking at it in jupyter
calc_hpds(trace_df, y)
#outputs a 2nd dataframe
calc_hpds(sample_df, y)
```

Obviously I am not expecting the values to literally be the exact same, but with 4000 replicates I was thinking the values would line up much better. Why don’t the values line up? Should the two methods produce the same result and I have made a mistake in my calculations, or is using the trace wrong approach, and hopefully someone can explain why and set me straight.

Thank you for your help,

-Kelley