Sample_posterior_predictive dependent variable vs deterministic dependent variable

Hello,

I am following this tutorial: https://github.com/PacktPublishing/Bayesian-Analysis-with-Python-Second-Edition/blob/master/Chapter03/03_Modeling%20with%20Linear%20Regressions.ipynb , and I would like to understand the difference between sampling your dependent variables using sample_posterior_predictive and simply extracting your dependent variable (µ below) from the trace (trace_g below)

You can use this code block to set your environment (taken from the link above)

import pymc3 as pm
import numpy as np
import pandas as pd
from theano import shared
import scipy.stats as stats
import matplotlib.pyplot as plt
import arviz as az

np.random.seed(1)
N = 100
alpha_real = 2.5
beta_real = 0.9
eps_real = np.random.normal(0, 0.5, size=N)

x = np.random.normal(10, 1, N)
y_real = alpha_real + beta_real * x
y = y_real + eps_real

with pm.Model() as model_g:
    α = pm.Normal('α', mu=0, sd=10)
    β = pm.Normal('β', mu=0, sd=1)
    ϵ = pm.HalfCauchy('ϵ', 5)
    μ = pm.Deterministic('μ', α + β * x)
    y_pred = pm.Normal('y_pred', mu=μ, sd=ϵ, observed=y)
    trace_g = pm.sample(2000, tune=1000, chains = 2)

Now we can sample the posterior predictive like this

ppc = pm.sample_posterior_predictive(trace_g, samples=4000, model=model_g)
sample_df = pd.DataFrame(ppc['y_pred'])

And get the µ's like this

trace_df = pm.trace_to_dataframe(trace_g)
#it so happens only µ columns have '_', this wouldn't always work
trace_df = df_test.loc[:, [col for col in list(trace_df.columns) if '_' in col]]

Now trace_df.shape = sample_df.shape = (4000,100), and we can define a function to calculate the hpd for each of 100 samples.

def calc_hpds(df, true_y):
    pred_hpds = []
    for i in list(df.columns):
        hpd = pm.hpd(df.loc[:, i])
        avg = df.loc[:, i].mean()
        pred_hpds.append({"lower": hpd[0], "upper": hpd[1], "mean": avg})

    hpds_df = pd.DataFrame(pred_hpds)
    hpds_df["true_y"] = true_y
   
    return hpds_df

Using calc_hpds, we can see the lower, upper, and mean values are not the same.

#outputs a dataframe, I was just looking at it in jupyter
calc_hpds(trace_df, y)

#outputs a 2nd dataframe
calc_hpds(sample_df, y)

Obviously I am not expecting the values to literally be the exact same, but with 4000 replicates I was thinking the values would line up much better. Why don’t the values line up? Should the two methods produce the same result and I have made a mistake in my calculations, or is using the trace wrong approach, and hopefully someone can explain why and set me straight.

Thank you for your help,

-Kelley

You are not getting µ from the sample_posterior_predictive, instead it’s y_pred - which is µ convoluted with a Gaussian with sigma=ϵ. You would get calc_hpds(sample_df, y) being larger range.

2 Likes

Hi!
Yeah as Junpeng said: y_pred is your likelihood, while µ is the expected value of this likelihood. As a result, y_pred has another layer of uncertainty, brought by its sigma.
Also, note that you can use arviz.hpd for your plots and diagnostics.

2 Likes

Interesting, thank you. Is this still the case in a logistic model where there is no sigma=ϵ, something like:

with pm.Model() as ret_risk_long: 
    ppv = (13/164)
    sens = (13/21)
    a = Normal('a', mu = 0.0, sigma = 2)
    b = Normal("b", mu=0.0, sigma=1, shape=3)
    y_hat = a + b[0] * indep_1 + b[1] * indep2 + b[2] * indep_3
    theta = pm.math.sigmoid(y_hat)
    theta = pm.Deterministic('theta', theta)
    y_like = pm.Bernoulli('y_like', p = theta, observed = dependent)
    prev = pm.Deterministic('prevalence', ppv*theta/sens)
    long_gt1177_trace = pm.sample(2000, tune=1000, chains=2)

Where theta is analogous to µ. Can anything interesting be done with the theta? For example something similar to a prediction interval that you can get using sample_posterior_predictive rather than the regression mean and hpd that one can get/plot from the coefficient estimates in the trace. This would be analogous to the difference between [9] and [11] in the example notebook.

Thank you

That would essentially be what you find in https://docs.pymc.io/notebooks/posterior_predictive.html
It is in general still not the same comparing the uncertainty in the latent variable and the uncertainty in the posterior predictive samples of the observed.

2 Likes