Hello,
I am following this tutorial: https://github.com/PacktPublishing/Bayesian-Analysis-with-Python-Second-Edition/blob/master/Chapter03/03_Modeling%20with%20Linear%20Regressions.ipynb , and I would like to understand the difference between sampling your dependent variables using sample_posterior_predictive
and simply extracting your dependent variable (µ
below) from the trace (trace_g
below)
You can use this code block to set your environment (taken from the link above)
import pymc3 as pm
import numpy as np
import pandas as pd
from theano import shared
import scipy.stats as stats
import matplotlib.pyplot as plt
import arviz as az
np.random.seed(1)
N = 100
alpha_real = 2.5
beta_real = 0.9
eps_real = np.random.normal(0, 0.5, size=N)
x = np.random.normal(10, 1, N)
y_real = alpha_real + beta_real * x
y = y_real + eps_real
with pm.Model() as model_g:
α = pm.Normal('α', mu=0, sd=10)
β = pm.Normal('β', mu=0, sd=1)
ϵ = pm.HalfCauchy('ϵ', 5)
μ = pm.Deterministic('μ', α + β * x)
y_pred = pm.Normal('y_pred', mu=μ, sd=ϵ, observed=y)
trace_g = pm.sample(2000, tune=1000, chains = 2)
Now we can sample the posterior predictive like this
ppc = pm.sample_posterior_predictive(trace_g, samples=4000, model=model_g)
sample_df = pd.DataFrame(ppc['y_pred'])
And get the µ
's like this
trace_df = pm.trace_to_dataframe(trace_g)
#it so happens only µ columns have '_', this wouldn't always work
trace_df = df_test.loc[:, [col for col in list(trace_df.columns) if '_' in col]]
Now trace_df.shape = sample_df.shape = (4000,100)
, and we can define a function to calculate the hpd for each of 100 samples.
def calc_hpds(df, true_y):
pred_hpds = []
for i in list(df.columns):
hpd = pm.hpd(df.loc[:, i])
avg = df.loc[:, i].mean()
pred_hpds.append({"lower": hpd[0], "upper": hpd[1], "mean": avg})
hpds_df = pd.DataFrame(pred_hpds)
hpds_df["true_y"] = true_y
return hpds_df
Using calc_hpds
, we can see the lower
, upper
, and mean
values are not the same.
#outputs a dataframe, I was just looking at it in jupyter
calc_hpds(trace_df, y)
#outputs a 2nd dataframe
calc_hpds(sample_df, y)
Obviously I am not expecting the values to literally be the exact same, but with 4000 replicates I was thinking the values would line up much better. Why don’t the values line up? Should the two methods produce the same result and I have made a mistake in my calculations, or is using the trace wrong approach, and hopefully someone can explain why and set me straight.
Thank you for your help,
-Kelley