11.16 Rethinking Code

I’m going through the statistical rethinking notebooks and I’m not sure I quite understand what’s happening in cell Out[17] of chapter 11, 11.16 link

with m11_4:
    pm.set_data({"actor_id": np.repeat(range(7), 4), "treat_id": list(range(4)) * 7})
    p_post = pm.sample_posterior_predictive(
        trace_11_4, random_seed=RANDOM_SEED, var_names=["p"]
p_mu = p_post.mean(0).reshape((7, 4))

I’m not sure I understand what pm.set_data is doing in this particular case. I’m also not sure I understand why we need to sample from the posterior_predictive – don’t we have samples from p in the trace?

Thanks for the help, as a noob I’m stumbling a bit here.

1 Like

After we performed model inference using MCMC sampling, we often want to check whether the model fitting is good or not by conditioned on some hold out data or future data. pm.set_data allows you to update the model input to some new data, more information here: https://docs.pymc.io/api/model.html?highlight=set_data#pymc3.model.set_data

after you set the new data, you need to update the the sample so that they conditioned on the new data as well, that’s what sample_posterior_predictive doing here.


Hi Thad,
Thanks for getting in touch; it’s nice to see this material is useful!

As Junpeng perfectly summed it up, here we’re simulating the fact of redoing the experience with the same 7 chimpanzees, across the same 4 treatments – only this time, each chimp gets each treatment only once, and not multiples times as in the original data that served to fit the model.

In essence, we’re simulating the data you see in Code 11.15: if we tested the 7 chimps with the 4 treatments again, which proportions of pulled_left are we expecting to see, given our model and assumptions? And then we compare those predicted proportions to the empirical data of Code 11.15, to get a sense of the model’s performance – that’s what’s meant by posterior predictive checks.
Actually, I even prefer the term posterior retrodictive checks, as we’re trying to see how good the model is at spitting out the data it was fed, with uncertainty estimation around these generated data because we’re Bayesians :wink:

Hope this helps, and enjoy the notebooks :vulcan_salute:


Thank you @junpenglao and @AlexAndorra. Your clarification helps! I’m keying into this bit:

As a check of my understanding of the sample_posterior_predictive method – specifically for this particular case: it’s really being used as a convenient way of grabbing samples from the posterior trace using a replication structure specified by set_data?

Procedurally this seems a little bit different from computing p as a proportion from a predictive distribution of Bernoulli trials. Am I correct thinking that since p is a deterministic function of the parameters a and b, there should be a value of p in trace_11_4 for every value returned by sample_posterior_predictive? i.e there is no _rng computation happening here.

Thanks again!

In this case, I would say yes. More globally, it just uses the generative nature of any Bayesian model:

  • You can infer parameters from the data, as we do classically – I like to call this “going upstream” and this is done with pm.sample in PyMC3.
  • But you can also go the other way around (“downstream”) and simulate data from the parameters you inferred in the first step. That way, you’re checking the internal consistency of your model and see if the data it simulates are not completely off.
    This is done with pm.sample_posterior_predictive but you can always do it by hand, as long as you know what the model is. Here for instance, I think you can do:
p_post = scipy.special.expit(trace_11_4["a"][new_actor_id] + trace_11_4["b"][new_treat_id]))

If you do the manual computation of p_post I did above, I think your statement is true. When using sample_posterior_predictive, I think it’s not though: IIRC, this function samples from the inferred posterior instead of directly using the posterior samples in the trace – in which case there is random sampling happening.
This makes sense though: there is uncertainty in the values of a and b compatible with the data and the model, so there is uncertainty in p (and, downstream, in the Bernoulli trials), so it makes sense that the simulated p samples won’t always be the same.

Hope this is clear enough :slight_smile:

1 Like

Thanks again for your reply. I’ll need to think a little more about how sample_posterior_predictive can sample from the posterior without running a new round of MCMC (maybe it does?). In any case, the notebooks are SUPER helpful! Cheers.

Actually, talking with @OriolAbril made me realize I mixed things up a little in my last explanation, since here we use pm.sample_posterior_predictive to sample a latent variable (p), not an observed variable (pulled_left).

So there is no random sampling involved and I think you’re right when saying that there is "a value of p in trace_11_4 for every value returned by sample_posterior_predictive".

Oriol is preparing a NB about that which will clear that up for you and demonstrate the manual computation I was talking about above – with some sweet xarray magic :wink: Also, chapter 13 will show you the different kinds of possible posterior predictive sampling (which is not possible with chapter 11’s model for various reasons).

1 Like

I put together a notebook that I hope will clarify the question. It builds on top of the previous answers adding examples and tries to showcase fancy features in ArviZ

The source of the notebook is in GitHub, however, as I am using html rich representations of the multidimensional arrays, the GitHub render is very slow and close to useless, so here is a proper render of the notebook.


@AlexAndorra nice work getting chapter 13 together so quickly! Yeah I see how things would be different if the model in chapter 11 was hierarchical. I look forward to working through that problem set. :smiley:

Thanks for the clarification. First off: that notebook rendering is really cool! Second, the xarray tricks you showcase are pretty handy. To the extent these notebooks can serve as something of a rosetta stone going from stan/R to pymc3 I imagine others might be interested in getting a better handle on all the coordinate/broadcasting magic. I ended doing something for option (1) by putting the data into pandas, but it feels a bit clunky.

pm_data = az.from_pymc3(
    trace=trace, coords={"treatment": data["treatment"], "actor": data["actor"]},

samples = pd.DataFrame(np.vstack(pm_data.posterior["p"]).T)
samples["treatment"] = pm_data.constant_data.treat_id
samples["actor"] = pm_data.constant_data.actor_id

samples_long_form = samples.melt(id_vars=["treatment", "actor"])
samples_long_form.groupby(["treatment", "actor"]).mean()
1 Like

ArviZ can also be used with PyStan results if it were to be easier. One of the main motivations for InferenceData is to create a standard format for all probabilistic programming languages, results are converted from PyMC3, PyStan, (Num)Pyro, TensorFlow Probability… to InferenceData so that ArviZ code can be backend agnostic. This allows to quickly support any library by adding only a converter function.

Option 1 is bound to be clunky unless you already have the data very well sorted :sweat_smile: (which is probably not the case in most real situations).

I tried to give it a go with xarray so that the result is the posterior dataset reshaped to have the correct dimensions for p which for me would be (actor, treatment, repetition). You can then average chain, draw, repetition dimensions to get the desired probabilities. I am not really sure I actually succeeded in reducing the clunkyness, I feel like it ended up increasing but still, I think the end result is quite interesting. I have updated the same notebook so I am not reposting the link.