Sample_posterior_predictive dependent variable vs deterministic dependent variable

kelleyjbrady · February 11, 2020, 1:45am

Hello,

I am following this tutorial: https://github.com/PacktPublishing/Bayesian-Analysis-with-Python-Second-Edition/blob/master/Chapter03/03_Modeling%20with%20Linear%20Regressions.ipynb , and I would like to understand the difference between sampling your dependent variables using sample_posterior_predictive and simply extracting your dependent variable (µ below) from the trace (trace_g below)

You can use this code block to set your environment (taken from the link above)

import pymc3 as pm
import numpy as np
import pandas as pd
from theano import shared
import scipy.stats as stats
import matplotlib.pyplot as plt
import arviz as az

np.random.seed(1)
N = 100
alpha_real = 2.5
beta_real = 0.9
eps_real = np.random.normal(0, 0.5, size=N)

x = np.random.normal(10, 1, N)
y_real = alpha_real + beta_real * x
y = y_real + eps_real

with pm.Model() as model_g:
    α = pm.Normal('α', mu=0, sd=10)
    β = pm.Normal('β', mu=0, sd=1)
    ϵ = pm.HalfCauchy('ϵ', 5)
    μ = pm.Deterministic('μ', α + β * x)
    y_pred = pm.Normal('y_pred', mu=μ, sd=ϵ, observed=y)
    trace_g = pm.sample(2000, tune=1000, chains = 2)

Now we can sample the posterior predictive like this

ppc = pm.sample_posterior_predictive(trace_g, samples=4000, model=model_g)
sample_df = pd.DataFrame(ppc['y_pred'])

And get the µ's like this

trace_df = pm.trace_to_dataframe(trace_g)
#it so happens only µ columns have '_', this wouldn't always work
trace_df = df_test.loc[:, [col for col in list(trace_df.columns) if '_' in col]]

Now trace_df.shape = sample_df.shape = (4000,100), and we can define a function to calculate the hpd for each of 100 samples.

def calc_hpds(df, true_y):
    pred_hpds = []
    for i in list(df.columns):
        hpd = pm.hpd(df.loc[:, i])
        avg = df.loc[:, i].mean()
        pred_hpds.append({"lower": hpd[0], "upper": hpd[1], "mean": avg})

    hpds_df = pd.DataFrame(pred_hpds)
    hpds_df["true_y"] = true_y
   
    return hpds_df

Using calc_hpds, we can see the lower, upper, and mean values are not the same.

#outputs a dataframe, I was just looking at it in jupyter
calc_hpds(trace_df, y)

#outputs a 2nd dataframe
calc_hpds(sample_df, y)

Obviously I am not expecting the values to literally be the exact same, but with 4000 replicates I was thinking the values would line up much better. Why don’t the values line up? Should the two methods produce the same result and I have made a mistake in my calculations, or is using the trace wrong approach, and hopefully someone can explain why and set me straight.

Thank you for your help,

-Kelley

junpenglao · February 11, 2020, 6:43am

You are not getting µ from the sample_posterior_predictive, instead it’s y_pred - which is µ convoluted with a Gaussian with sigma=ϵ. You would get calc_hpds(sample_df, y) being larger range.

AlexAndorra · February 11, 2020, 10:41am

Hi!
Yeah as Junpeng said: y_pred is your likelihood, while µ is the expected value of this likelihood. As a result, y_pred has another layer of uncertainty, brought by its sigma.
Also, note that you can use arviz.hpd for your plots and diagnostics.

kelleyjbrady · February 11, 2020, 4:30pm

Interesting, thank you. Is this still the case in a logistic model where there is no sigma=ϵ, something like:

with pm.Model() as ret_risk_long: 
    ppv = (13/164)
    sens = (13/21)
    a = Normal('a', mu = 0.0, sigma = 2)
    b = Normal("b", mu=0.0, sigma=1, shape=3)
    y_hat = a + b[0] * indep_1 + b[1] * indep2 + b[2] * indep_3
    theta = pm.math.sigmoid(y_hat)
    theta = pm.Deterministic('theta', theta)
    y_like = pm.Bernoulli('y_like', p = theta, observed = dependent)
    prev = pm.Deterministic('prevalence', ppv*theta/sens)
    long_gt1177_trace = pm.sample(2000, tune=1000, chains=2)

Where theta is analogous to µ. Can anything interesting be done with the theta? For example something similar to a prediction interval that you can get using sample_posterior_predictive rather than the regression mean and hpd that one can get/plot from the coefficient estimates in the trace. This would be analogous to the difference between [9] and [11] in the example notebook.

Thank you

junpenglao · February 11, 2020, 6:12pm

That would essentially be what you find in https://docs.pymc.io/notebooks/posterior_predictive.html
It is in general still not the same comparing the uncertainty in the latent variable and the uncertainty in the posterior predictive samples of the observed.

Topic		Replies	Views
Deterministic and observed RV behaviour when using sample_posterior_predictive Questions	5	1085	January 24, 2019
Are samples from the trace equivalent to samples from pm.sample_posterior_predictive? Questions	6	637	October 16, 2020
Mean of mu and mean of predictive posterior distribution Questions linear_model	2	1466	December 11, 2021
Deterministic posterior predictive?	5	542	November 14, 2023
Posterior predictive sampling with data variance Questions	10	2396	September 14, 2018

Sample_posterior_predictive dependent variable vs deterministic dependent variable

Related topics