Some misunderstandings with Prior predictive checks

As indicated in PyMC doc, HERE :

Prior predictive checks […] allow you to check whether you are indeed incorporating scientific knowledge into your model – in short, they help you check how credible your assumptions before seeing the data are.

The most important point, in my opinion, is the assumption “before seeing the data”, which seems logical.

However, it gives me a modeling problem, which I explain below:

Let B be a certain random variable following a Bernoulli law, where the parameter involved is, say, θ. Suppose I plan an experiment consisting of n i.i.d realizations of B; the number x∈[[0,n]] of successes follows a binomial distribution X~Binom(n,θ) and the likelihood of x knowing θ is, unless I err:

f(x|θ)= \begin{pmatrix}n\\x \end{pmatrix} θ^x (1-θ)^{(n-x)}

Previous knowledge on that particular B suggest me that θ must be close to 0 or 1. I then take as prior the beta distribution Beta(1⁄2,1⁄2) and write:

import pymc as pm

with pm.Model() as model:

 p = pm.Beta("p", alpha=0.5, beta=0.5)

 idata = pm.sample()

 pm.sample_prior_predictive(return_inferencedata=True)

 idata.extend(pm.sample_prior_predictive())

Note that, for the moment, neither n nor x have been defined and I have no data yet…

My first miscomprehension comes from the fact that both groups idata.posterior
and idata.prior contain the same variable p with different values and shapes: idata.posterior ["p"].shape= (4, 1000) and idata.prior ["p"].shape= (1, 500).

  • How can this be explained to someone new to Bayesian inference?

My second miscomprehension comes from the PyMC doc sentence: “help you check how credible your assumptions before seeing the data are”.

The only assumption made so far is to include the beta distribution Beta(1⁄2,1⁄2) in the model, so…

  • How can someone new to Bayesian inference check if this assumption is credible, knowing that idata contains only the groups posterior, sample_stats and prior?

I know something still escapes me in the logic of all this… But could my two questions be answered, as basic as they are, to help me progress?

I hope I have been understandable and not excessively tedious.

1 Like

The prior group came from the prior predictive line of code. You could delay the pm.sample() code to avoid being confused by the posterior group before you are ready for it (done checking your assumptions).
Then you could use arviz to plot the prior distribution. You will see the density is high near one and near zero; agreeing with your subject knowledge. Once you add more of your data generating process, culminating in the data like what you plan to be collecting, you will try to see if the results are reasonable.

1 Like

I think it’s a bit difficult to see how these different methods are supposed to work with such a simple model. If you haven’t already, I would strongly suggest checking out the prior/posterior checking notebook.

In the case of a model with no observed variables (like the one you propose), prior predictive checks aren’t really “predictive checks” because the prior predictive distribution is a distribution over the observed variable(s). So when you call pm.sample_prior_predictive() on a model with no observed variables, PyMC simply samples from your prior nad adds those draws to your inference object in the prior group. So in your case, you have draws of p. When you call pm.sample(), you get draws from the posterior. The prior values are those that you (the analyst) believe are reasonable before seeing the data and the posterior values are those that you (the analyst) believe are reasonable after seeing the data. Given that you haven’t seen any data, you would expect these values to be drawn from the same distribution. The reason the idata.prior and idata.posterior groups are different shapes is simply because you have requested (by relying on the default arguments) for different numbers of prior and posterior draws.

If you instead want to know what sorts of data your priors imply, you need to a) finish specifying the model and b) provide some information about what kind of scenario you wish to simulate. In the case of a beta-binomial model, you need to add the binomial component of the model. Then you need to specify the number of observations and the number of “flips” associated with each observation. If you had data, these values would be easy enough to specify:

obs_heads = [10,9,8,7,6]
obs_flips = [20,20,10,10,10]

with pm.Model() as model:
    p = pm.Beta("p", alpha=0.5, beta=0.5)
    x = pm.Binomial("x", n=obs_flips, p=p, observed=obs_heads)
    idata = pm.sample_prior_predictive()
    idata.extend(pm.sample())

If you don’t have data, the binomial-distributed parameter can be unobserved. But you would still need to specify how many of these binomial-distributed values (i.e., simulated observations) you wish to generate and specify how many “flips” each observation should be associated with:

with pm.Model() as model:
    p = pm.Beta("p", alpha=0.5, beta=0.5)
    x = pm.Binomial("x", n=[100, 5, 27, 3, 42], p=p, shape=5)
    idata = pm.sample_prior_predictive()
    idata.extend(pm.sample())
2 Likes

Hi cluhmann, your answer is rich and deserves that I spend time to deepen it. But, first of all, there is a first point that I would like to address…

I understand that calling pm.sample() means sampling from the posterior and that I can change the default number of draws as, for example, idata = pm.sample(2500), which will give me idata.posterior["p"].shape = (4, 2500) [4 chains and 2500 draw iterations, as requested].

I understand also that calling pm.sample_prior_predictive() means sampling from the prior and that the default is samples=500. But, strangely, changing the default to 1000, for example:

pm.sample_prior_predictive(samples=1000, return_inferencedata=True)

doesn’t change the result and still gives me idata.prior["p"].shape = (1, 500).

What can be the reason for this?

It’s not clear what model you are using pm.sample_prior_predictive() on. But running it on either of the models I outlined produces the expected number of prior samples.

with pm.Model() as model:
    p = pm.Beta("p", alpha=0.5, beta=0.5)
    x = pm.Binomial("x", n=[100, 5, 27, 3, 42], p=p, shape=5)
    idata = pm.sample_prior_predictive(samples=1234)

print(idata.prior["x"].values.shape)
# (1, 1234, 5)

Oh yes, actually, I didn’t run it on any of the models you outlined, but I just reproduced the first model I used before:

 import pymc as pm
 with pm.Model() as model:
      p = pm.Beta("p", alpha=0.5, beta=0.5)
      idata = pm.sample(2500)
      pm.sample_prior_predictive(samples=1000, return_inferencedata=True)
      idata.extend(pm.sample_prior_predictive())

I will look for that…

In that snippet, you are calling pm.sample_prior_predictive() twice, with the second overwriting the first [with the result of the second never being saved].

How stupid I am… This is what it’s like to want to learn on your own: you always forget something. Understood now, thank you.

1 Like