Model shape and best practice for sampling prior predictive on model without observed data

I’ve been following the lecture course on Statistical Rethinking (which I’m enjoying immensely). Below I’m playing with data associated with the Howell1 dataset.

It’s been a while since I used pymc, and I’ve got a reasonably basic question on what is good practice if I want to “play around with priors”.

In the example below, I’ve passed some random height data to the model, and I then sample the prior predictive. When I plot it with arviz, it seems to think I want one distribution per datapoint. If I uncomment the observed line further down, I get a single distribution, which is what I want.

How should I structure my code to play around with this in a good way?

import numpy as np
import pymc as pm
import arviz as az

rng = np.random.default_rng(seed=42)

N = 10
H = rng.uniform(low=130, high=170, size=N)
H_norm = H - H.mean()

with pm.Model() as m:
    H_data = pm.Data("height", H_norm, mutable=True)

    a = pm.Normal(name="a", mu=60, sigma=10)
    b = pm.LogNormal(name="b", mu=0, sigma=1)
    mean = a + b * H_data
    sigma = pm.Uniform("sigma", lower=0, upper=10)

    W = pm.Normal(
        # observed= rng.uniform(low=3000, high=6000, size=N) # uncomment this line
    idata = pm.sample_prior_predictive(random_seed=42)

az.plot_posterior(idata, group="prior");

idata with observed commented out:

idata with observed uncommented:

arviz’ plot, showing one distribution per “weight” (this is undesirable):

The default option is to do what you did, just create a model with observations and one without. You can also just ignore the observed dimensions in prior_predictive (or stack them) since they are IID

Okay. It seems a little bit messy that the number of data variables and coordinates changes depending on the presence of observed data, but I guess I can live with that.

Is there no way that I can tell the output model that it should expect to have the same shape as H_norm? Through dims or shape?

You can, but you must make it 1D. you can’t change the number of dimensions of PyMC variables, only the sizes of each dimension:

Untested code:

import numpy as np
import pymc as pm

with pm.Model() as m:
  x = pm.Normal("x")
  obs_data = pm.MutableData("obs_data", [0], dims="obs")  # Some dummy value
  y = pm.Normal("y", x, 1, observed=obs_data, dims="obs")

  prior = pm.sample_prior_predictive()

  # Set real observed data
  pm.set_data({"obs_data": np.arange(100)})
  posterior = pm.sample()

I don’t think that’s easier than just creating two models though. You can also play with the shape argument directly:

Going out on a hunch that you saw some Stan code or started using PyMC with already some basic Stan knowledge. Unlike Stan, in PyMC you don’t need to redefine the model from scratch (or at all) to sample from the prior. I try to explain why below:

when you comment or uncomment this line, you are first and foremost updating the type of model (and the metadata associated with it).

With the “observed” line uncommented, you have a model with 3 parameters: a, b and sigma. And those parameters are combined with some constant data/predictors/covariates/… to make predictions on the “weight” variable. In that case, the posterior and prior groups will only have these 3 variables. prior and posterior groups hold variables in the parameter space. prior_predictive and posterior_predictive hold variables on the observed space.

When you uncomment the “observed” line you are actually saying “no, this is not a likelihood term, I want an extra parameter that is actually a vector of lenght N; also, I have no observed data at all”.

Sampling from the prior and prior predictive is done with forward sampling, so the result of sample_prior_predictive will actually be the same in both cases. We first sample a, b and sigma, then use those values to generate samples from “weight”. The observed keyword is basically ignored by sample_prior_predictive. The difference is that in the first case, that variable is stored in the prior_predictive group, as you have indicated it is on the observed space, whereas in the second it is stored in the prior group as you are indicating it is a parameter instead.

Thanks you to both @ricardoV94 and @OriolAbril for the in-depth answers! Sorry for taking some time to respond! I’ve been coming back to the following post several times, trying to formulate my thoughts.

I’ve actually never used Stan, but Oriol’s comment is a very interesting read!

I’ve tried responding several times here, but never quite managed to formulate my thoughts correctly (they’re to do with understanding the shape of model with and without observed data). As I work more with pymc, I think this topic will become clearer to me. :blush: