Posterior Predictive check_ limited to certain sample sizes?

I am new to pymc3 and python.
To illustrate my question please consider the following. A study has being completed and the posterior distribution derived. To predict a the probability of a single new future value we using samples: :ppc[“y”].ravel() from the Posterior predictive check
ppc .=pm.sample_posterior_predictive(trace,2000,model). In a future study we can even predict what the future mean (of a future study with the same sample size as original study) from PYMC3 with samples: ppc[“y”].mean(axis=1).Indeed we can even extend this to multiples M of the original sample size using samples:ppc[“y”].mean(axis=(1,2)) from the ppc with size argument included:pm.sample_posterior_predictive(trace,2000,model,size=M). However my colleague wants extend my study : in other words he wants me to predict the ppc the mean predicted for an extended study of 71 subjects using pymc3. This is not a multiple of 37 the original sample size. we could use the scipy library but would prefer if PYMC3 could do it in a simple way for a newbee.

#To predict the mean the mean of a future sample size N using Scipy_library and pymc3 trace posterior values is easy:
mu =trace[“mu”]
sd = trace[“sd”]
N= 37
yy =stats.norm(mu,sd/np.sqrt(New sample size)).rvs(4000)
ax1.hist(yy, histtype = “step”, density = True)
az.plot_dist(yy,color=“y”, label = “Sci_py predicted value of future mean of size N”,ax =ax1 )

Hope there is a simple way to adjust the positive check to give predictions for any sample size,
Does anybody have a simple way of doing this in PYMC3.
Thank you for your expertise.

Can you just change the size of your model using a pm.Data container?

Thanks so much for suggestion, I will research the pm.DataContainer. I hope that it does not involve an understanding of Theano , if so I will probably be lost. I would be surprised if pymc3 was limited to predicting only limited values since the posterior distribution is available one would expect we could predict future values with any sample size. Thanks again for your help and I will follow up on the pm.Datacontainer… thanks again.

Can we see what your model looks like? Is it hierarchical?

I hope that it does not involve an understanding of Theano , if so I will probably be lost.

It will not.

I would be surprised if pymc3 was limited to predicting only limited values…

It is not. But how exactly to best approach this depends on the structure of your model and the structure of your data.

1 Like
import numpy as np
import pymc3 as pm
import matplotlib.pyplot as plt
import arviz as az
import scipy.stats as stats
plt.style.use('seaborn-darkgrid')
%matplotlib inline
plt.rcParams['axes.labelsize'] = 16
plt.rcParams['axes.titlesize'] = 20

#-----------------------------------------------

data_sample_size=37 #for original study.
data=  np.random.normal(2, 1, data_sample_size)

#-------------------------------------------------
with pm.Model() as model:
    mu = pm.Normal('mu', mu=0, sd=3,testval=0)
    sd = pm.HalfNormal('sd', sd=10)
    y = pm.Normal('y', mu=mu, sd=sd, observed=data)   
   
    trace = pm.sample(2000,tune=1000)
    ppc =pm.sample_posterior_predictive(trace,2000,model)
#-----------------------------------------------------------------------------
fig=plt.figure(figsize=(20,5))
ax1=fig.add_subplot(121,title='Posterior prediction for a future single value\n and also the mean of N future values', xlabel='y', ylabel='Probability Densit')
ax1.set_xlim(0,4)
ax1.legend(loc="upper left")


#To predict a single future value from posterior or the  mean of a 
#sample with same sample original sample size (n = 37):
with model:
    ppc =pm.sample_posterior_predictive(trace,2000,model)
   
    az.plot_dist(ppc["y"].ravel(),ax =ax1,color ="g",label ="kde for predicting value single value" , kind="kde")
  
    az.plot_dist(ppc["y"].mean(axis=1) ,ax=ax1,color = "r",label= "kde of mean value\nwith original sample size(n=37)", kind="kde")
    
#Predicting the mean the mean of a future sample size N_new 
#using Scipy_library and pymc3 trace posterior values is easy:

mu =trace["mu"]
sd = trace["sd"]
N= 37 #original samples size to test procedure.
yy =stats.norm(mu,sd/np.sqrt(N)).rvs(4000)
ax1.hist(yy, histtype = "step", density = True)
az.plot_dist(yy,color="y", label = "Sci_py predicted value of future mean of size N",ax =ax1 )
fig.savefig("Original sample size")
#plt.show()

#----------------------------------------------------------
ppc["y"].shape,ppc["y"].mean(axis=1).shape,trace["mu"].shape
-------------------------------------------------------------
#To predict a future mean of a sample with a multiple (integer M) of original sample size (n = 37):
fig=plt.figure(figsize=(20,5))
ax1=fig.add_subplot(121,title='Posterior prediction for a future mean with sample size (M x N)', xlabel='y', ylabel='Probability Densit')
ax1.set_xlim(0,4)
ax1.legend(loc="upper left")


with model:
    N= 37 # Original study sample size.
    M=3   # Multiple of original sample size, new sample size = 3 x 37
    
    ppc =pm.sample_posterior_predictive(trace,2000,model,size=M)
    az.plot_dist(ppc["y"].mean(axis=(1,2)) ,ax=ax1,color = "r",label= "kde of mean predicted value with\n multiple of original sample size(n=37)", kind="kde")

yy =stats.norm(mu,sd/np.sqrt(N*M)).rvs(4000)
az.plot_dist(yy,color="y", label = "Sci_py predicted value of future mean with sample size\n a multiple of original sample size.",ax =ax1 )  
fig.savefig("Multiple M of Original sample size") 
#--------------------------------------------            
ppc["y"].mean(axis=(1,2)).shape
#-------------------------------------------


? how to predict the mean for any sample size from posterior distribution of original study using ppc.
#Thanks for looking at the code and my attempt at predicting for different sample sizes in future draws.

For a model this simple, my tendency is to not use pymc’s built-in ppc functionality. Something like this:

# we'll generate a new ppc data set for each of the
# MCMC samples we asked for when we called pm.sample()
# each data set will include new_N observations
new_N = 71
ppcs = stats.norm(trace['mu'],
                  trace['sd']).rvs([new_N,
                                   len(trace['mu'])])
# ppcs.shape = (new_N, len(trace['mu']))

# calculate mean of each ppc data set
ppc_means = np.mean(ppcs, axis=0)

#ppc_means.shape = (len(trace['mu']), )

# now plot, do prediction, etc.

I also think that evaluating posterior predictive behavior “by hand” like this is useful because it forces you to be explicit and thus transparent. This explicitness can hurt (a lot) when your model grows in complexity, but it can help when things are simpler. Using the trace object and doing things by hand also opens up the possibility of much more flexible posterior investigations than a simple posterior prediction “check”.

If you wish to skip doing things by hand and use pm.sample_posterior_predictive() directly, then you can use pm.Data as @twiecki suggested:

First, you must add this line to your model:

with pm.Model() as model:
    data = pm.Data('data', data)
...

Then, you can ask for ppc samples of your original data:

with model:
    ppc=pm.sample_posterior_predictive(trace,
                                       size=100)

    #ppc.shape = (len(trace['mu']), 100, 37)

Or you can swap out your data with new, “dummy” data of the appropriate dimensionality. Here, your data is entirely exogenous (it only includes y), so its contents won’t influence the model predictions. If you had predictors (x's), then you would want to swap out the data such that the x's are those associated with your 71 new observations.

with model:
    pm.set_data({'data': np.ones(71)})
    ppc2=pm.sample_posterior_predictive(trace,
                                        size=100)

    #ppc.shape = (len(trace['mu']), 100, 71)

Now you can calculate means or whatever else you might want to do.

Thank you for the very helpful response. It was a bonus to get not only one way of solving the prediction problem for a new sample size but two different ways. The first way worked perfectly. the second way (“swapping in new dummy data”) I got an error and I am obviously mis-interpreting the method or implementing it incorrectly. It makes sense to introduce “dummy data” of the right shape to make new predictions(for sample size = 71) or for a new future study with a different sample size and then use the posterior trace from the previous study to get new predictions (ppc2). however I am missing some understanding or not following it through correctly. this is what I tried:

import numpy as np
import pymc3 as pm
import matplotlib.pyplot as plt
import arviz as az
#------------------------------------------------------------
data_sample_size=37 #for original study.
data= np.random.normal(2, 1, data_sample_size)
#-----------------------------------------------------------
with pm.Model() as model:
mu = pm.Normal(‘mu’, mu=0, sd=3,testval=0)
sd = pm.HalfNormal(‘sd’, sd=10)
y = pm.Normal(‘y’, mu=mu, sd=sd, observed=data)
trace = pm.sample(2000,tune=1000)
ppc =pm.sample_posterior_predictive(trace,2000,model)
#-------------------------------------------------------------------------------------
ppc[“y”].shape
#Output: (2000, 37)
#------------------------------------------------------------
with pm.Model() as model:

data = pm.Data('data', data)
pm.set_data({'data': np.ones(71)})

ppc2 =pm.sample_posterior_predictive(trace,2000,model)

#--------------------------------------------------------------------------------------------------
az.plot_kde(ppc2[“y”])
ppc2[“y”].shape
#------------------------------------------------------------------------------------
I am interpreting ppc2[“y”] as the array of predictions for the data of the new shape required.
However, unfortunately the code as I implemented it just raises a pymc3 error: " KeyError: ‘y’ "

I would love to be able to “swap in new data” successfully, can you possible show me where my implementation of the method is wrong or how to follow through to get predictions with the shape of the “dummy data” by this method which allows broadening the possible predictions. Thanks again for your guidance .

Two things. First, you should include the data = pm.Data('data', data) line in your original model specification, not when you go to grab ppc samples. Second, when you do go to grab your ppc samples, you should re-instantiate the context of your original model (i.e., with model:), not create a new model (when you write with pm.Model() as model: you are creating a brand new, empty model).

First:

with pm.Model() as model:
    data = pm.Data('data', data)
    mu = pm.Normal('mu', mu=0, sd=3,testval=0)
    sd = pm.HalfNormal('sd', sd=10)
    y = pm.Normal('y', mu=mu, sd=sd, observed=data)
    trace = pm.sample(2000,tune=1000)

And second:

with model:
    pm.set_data({'data': np.ones(71)})
    ppc2=pm.sample_posterior_predictive(trace, size=100)

Note that you do not need to pass the model to sample_posterior_predictive because it’s been wrapped in the with model: context.

1 Like

Briallant !,
Thanks for getting me by that impass.
Now works briallant, I look forward to more flexible predictions in the future.
Thanks again for your patience and reccomendations.

1 Like