Posterior Predictive check_ limited to certain sample sizes?

Nn_Nnn · March 30, 2021, 7:30pm

I am new to pymc3 and python.
To illustrate my question please consider the following. A study has being completed and the posterior distribution derived. To predict a the probability of a single new future value we using samples: :ppc[“y”].ravel() from the Posterior predictive check
ppc .=pm.sample_posterior_predictive(trace,2000,model). In a future study we can even predict what the future mean (of a future study with the same sample size as original study) from PYMC3 with samples: ppc[“y”].mean(axis=1).Indeed we can even extend this to multiples M of the original sample size using samples:ppc[“y”].mean(axis=(1,2)) from the ppc with size argument included:pm.sample_posterior_predictive(trace,2000,model,size=M). However my colleague wants extend my study : in other words he wants me to predict the ppc the mean predicted for an extended study of 71 subjects using pymc3. This is not a multiple of 37 the original sample size. we could use the scipy library but would prefer if PYMC3 could do it in a simple way for a newbee.

#To predict the mean the mean of a future sample size N using Scipy_library and pymc3 trace posterior values is easy:
mu =trace[“mu”]
sd = trace[“sd”]
N= 37
yy =stats.norm(mu,sd/np.sqrt(New sample size)).rvs(4000)
ax1.hist(yy, histtype = “step”, density = True)
az.plot_dist(yy,color=“y”, label = “Sci_py predicted value of future mean of size N”,ax =ax1 )

Hope there is a simple way to adjust the positive check to give predictions for any sample size,
Does anybody have a simple way of doing this in PYMC3.
Thank you for your expertise.

twiecki · March 31, 2021, 11:51am

Can you just change the size of your model using a pm.Data container?

Nn_Nnn · March 31, 2021, 12:26pm

Thanks so much for suggestion, I will research the pm.DataContainer. I hope that it does not involve an understanding of Theano , if so I will probably be lost. I would be surprised if pymc3 was limited to predicting only limited values since the posterior distribution is available one would expect we could predict future values with any sample size. Thanks again for your help and I will follow up on the pm.Datacontainer… thanks again.

cluhmann · March 31, 2021, 1:24pm

Can we see what your model looks like? Is it hierarchical?

I hope that it does not involve an understanding of Theano , if so I will probably be lost.

It will not.

I would be surprised if pymc3 was limited to predicting only limited values…

It is not. But how exactly to best approach this depends on the structure of your model and the structure of your data.

Nn_Nnn · March 31, 2021, 2:25pm

import numpy as np
import pymc3 as pm
import matplotlib.pyplot as plt
import arviz as az
import scipy.stats as stats
plt.style.use('seaborn-darkgrid')
%matplotlib inline
plt.rcParams['axes.labelsize'] = 16
plt.rcParams['axes.titlesize'] = 20

#-----------------------------------------------

data_sample_size=37 #for original study.
data=  np.random.normal(2, 1, data_sample_size)

#-------------------------------------------------
with pm.Model() as model:
    mu = pm.Normal('mu', mu=0, sd=3,testval=0)
    sd = pm.HalfNormal('sd', sd=10)
    y = pm.Normal('y', mu=mu, sd=sd, observed=data)   
   
    trace = pm.sample(2000,tune=1000)
    ppc =pm.sample_posterior_predictive(trace,2000,model)
#-----------------------------------------------------------------------------
fig=plt.figure(figsize=(20,5))
ax1=fig.add_subplot(121,title='Posterior prediction for a future single value\n and also the mean of N future values', xlabel='y', ylabel='Probability Densit')
ax1.set_xlim(0,4)
ax1.legend(loc="upper left")


#To predict a single future value from posterior or the  mean of a 
#sample with same sample original sample size (n = 37):
with model:
    ppc =pm.sample_posterior_predictive(trace,2000,model)
   
    az.plot_dist(ppc["y"].ravel(),ax =ax1,color ="g",label ="kde for predicting value single value" , kind="kde")
  
    az.plot_dist(ppc["y"].mean(axis=1) ,ax=ax1,color = "r",label= "kde of mean value\nwith original sample size(n=37)", kind="kde")
    
#Predicting the mean the mean of a future sample size N_new 
#using Scipy_library and pymc3 trace posterior values is easy:

mu =trace["mu"]
sd = trace["sd"]
N= 37 #original samples size to test procedure.
yy =stats.norm(mu,sd/np.sqrt(N)).rvs(4000)
ax1.hist(yy, histtype = "step", density = True)
az.plot_dist(yy,color="y", label = "Sci_py predicted value of future mean of size N",ax =ax1 )
fig.savefig("Original sample size")
#plt.show()

#----------------------------------------------------------
ppc["y"].shape,ppc["y"].mean(axis=1).shape,trace["mu"].shape
-------------------------------------------------------------
#To predict a future mean of a sample with a multiple (integer M) of original sample size (n = 37):
fig=plt.figure(figsize=(20,5))
ax1=fig.add_subplot(121,title='Posterior prediction for a future mean with sample size (M x N)', xlabel='y', ylabel='Probability Densit')
ax1.set_xlim(0,4)
ax1.legend(loc="upper left")


with model:
    N= 37 # Original study sample size.
    M=3   # Multiple of original sample size, new sample size = 3 x 37
    
    ppc =pm.sample_posterior_predictive(trace,2000,model,size=M)
    az.plot_dist(ppc["y"].mean(axis=(1,2)) ,ax=ax1,color = "r",label= "kde of mean predicted value with\n multiple of original sample size(n=37)", kind="kde")

yy =stats.norm(mu,sd/np.sqrt(N*M)).rvs(4000)
az.plot_dist(yy,color="y", label = "Sci_py predicted value of future mean with sample size\n a multiple of original sample size.",ax =ax1 )  
fig.savefig("Multiple M of Original sample size") 
#--------------------------------------------            
ppc["y"].mean(axis=(1,2)).shape
#-------------------------------------------

? how to predict the mean for any sample size from posterior distribution of original study using ppc.
#Thanks for looking at the code and my attempt at predicting for different sample sizes in future draws.

cluhmann · March 31, 2021, 5:35pm

For a model this simple, my tendency is to not use pymc’s built-in ppc functionality. Something like this:

# we'll generate a new ppc data set for each of the
# MCMC samples we asked for when we called pm.sample()
# each data set will include new_N observations
new_N = 71
ppcs = stats.norm(trace['mu'],
                  trace['sd']).rvs([new_N,
                                   len(trace['mu'])])
# ppcs.shape = (new_N, len(trace['mu']))

# calculate mean of each ppc data set
ppc_means = np.mean(ppcs, axis=0)

#ppc_means.shape = (len(trace['mu']), )

# now plot, do prediction, etc.

I also think that evaluating posterior predictive behavior “by hand” like this is useful because it forces you to be explicit and thus transparent. This explicitness can hurt (a lot) when your model grows in complexity, but it can help when things are simpler. Using the trace object and doing things by hand also opens up the possibility of much more flexible posterior investigations than a simple posterior prediction “check”.

cluhmann · March 31, 2021, 5:58pm

If you wish to skip doing things by hand and use pm.sample_posterior_predictive() directly, then you can use pm.Data as @twiecki suggested:

First, you must add this line to your model:

with pm.Model() as model:
    data = pm.Data('data', data)
...

Then, you can ask for ppc samples of your original data:

with model:
    ppc=pm.sample_posterior_predictive(trace,
                                       size=100)

    #ppc.shape = (len(trace['mu']), 100, 37)

Or you can swap out your data with new, “dummy” data of the appropriate dimensionality. Here, your data is entirely exogenous (it only includes y), so its contents won’t influence the model predictions. If you had predictors (x's), then you would want to swap out the data such that the x's are those associated with your 71 new observations.

with model:
    pm.set_data({'data': np.ones(71)})
    ppc2=pm.sample_posterior_predictive(trace,
                                        size=100)

    #ppc.shape = (len(trace['mu']), 100, 71)

Now you can calculate means or whatever else you might want to do.

Nn_Nnn · April 1, 2021, 3:05pm

Thank you for the very helpful response. It was a bonus to get not only one way of solving the prediction problem for a new sample size but two different ways. The first way worked perfectly. the second way (“swapping in new dummy data”) I got an error and I am obviously mis-interpreting the method or implementing it incorrectly. It makes sense to introduce “dummy data” of the right shape to make new predictions(for sample size = 71) or for a new future study with a different sample size and then use the posterior trace from the previous study to get new predictions (ppc2). however I am missing some understanding or not following it through correctly. this is what I tried:

import numpy as np
import pymc3 as pm
import matplotlib.pyplot as plt
import arviz as az
#------------------------------------------------------------
data_sample_size=37 #for original study.
data= np.random.normal(2, 1, data_sample_size)
#-----------------------------------------------------------
with pm.Model() as model:
mu = pm.Normal(‘mu’, mu=0, sd=3,testval=0)
sd = pm.HalfNormal(‘sd’, sd=10)
y = pm.Normal(‘y’, mu=mu, sd=sd, observed=data)
trace = pm.sample(2000,tune=1000)
ppc =pm.sample_posterior_predictive(trace,2000,model)
#-------------------------------------------------------------------------------------
ppc[“y”].shape
#Output: (2000, 37)
#------------------------------------------------------------
with pm.Model() as model:

data = pm.Data('data', data)
pm.set_data({'data': np.ones(71)})

ppc2 =pm.sample_posterior_predictive(trace,2000,model)

#--------------------------------------------------------------------------------------------------
az.plot_kde(ppc2[“y”])
ppc2[“y”].shape
#------------------------------------------------------------------------------------
I am interpreting ppc2[“y”] as the array of predictions for the data of the new shape required.
However, unfortunately the code as I implemented it just raises a pymc3 error: " KeyError: ‘y’ "

I would love to be able to “swap in new data” successfully, can you possible show me where my implementation of the method is wrong or how to follow through to get predictions with the shape of the “dummy data” by this method which allows broadening the possible predictions. Thanks again for your guidance .

cluhmann · April 1, 2021, 3:29pm

Two things. First, you should include the data = pm.Data('data', data) line in your original model specification, not when you go to grab ppc samples. Second, when you do go to grab your ppc samples, you should re-instantiate the context of your original model (i.e., with model:), not create a new model (when you write with pm.Model() as model: you are creating a brand new, empty model).

First:

with pm.Model() as model:
    data = pm.Data('data', data)
    mu = pm.Normal('mu', mu=0, sd=3,testval=0)
    sd = pm.HalfNormal('sd', sd=10)
    y = pm.Normal('y', mu=mu, sd=sd, observed=data)
    trace = pm.sample(2000,tune=1000)

And second:

with model:
    pm.set_data({'data': np.ones(71)})
    ppc2=pm.sample_posterior_predictive(trace, size=100)

Note that you do not need to pass the model to sample_posterior_predictive because it’s been wrapped in the with model: context.

Nn_Nnn · April 1, 2021, 6:51pm

Briallant !,
Thanks for getting me by that impass.
Now works briallant, I look forward to more flexible predictions in the future.
Thanks again for your patience and reccomendations.

Topic		Replies	Views
Posterior Predictive Checks Questions	1	599	May 19, 2019
How to use the posterior predictive distribution for checking a model from PyMC version agnostic arviz , model-checking	10	3889	March 14, 2023
Sample_posterior_predictive() works fine in PyMC 3, raises exception in v4 v5	4	571	October 10, 2022
How to get posterior predictive distribution sample data for a single prediction? v5	1	63	November 5, 2024
Interpretation of posterior predictive checks for a Gaussian Process v5 gaussian_process , modeling , arviz	11	1081	August 24, 2023

Posterior Predictive check_ limited to certain sample sizes?

Related topics