Trouble understanding how to use sample_posterior_predictive to generate a prediction

I’m quite new to PyMC and admitteldy much more comfortable in R and Stan than Py so I’m probably doing something rudimentary wrong and I’m not tracking all the variations between different versions of PyMC. I have a very simple model that I’ve been toying with to learn that I’ve copied from here. I had started with more complex modeling but realized that I was in way over my head.

Anyhow, here is my attempt to get some predictions:

# Predictions - generate new data
new_data = np.random.randn(3, 3)

#concat my existing features with new ones?
tmp = np.concatenate((features, new_data))

# should coords have 3 or 103? Neither seems to work confusingly.
new_coords = {
          # dim 1: len of df
          'obs_id': np.arange(0,3), #is this all coords from concat'd set or just for new?
          # dim 2: feature cats.
          'features': ['treatment', 'cov1', 'cov2']
}

with hierarchical_model:
    # set it over the feature data? Is this needed? doesn’t work anyways.
    hierarchical_model.set_data(name = "feature_data", values = tmp, coords=new_coords)
    # generate preds, seems to work
    predictions = pm.sample_posterior_predictive(hierarchical_trace, predictions=True)
    # this doesn’t exist any more, not sure how to replace it?
    preds = az.from_pymc3_predictions(predictions,
                              coords=new_coords,
                              idata_orig=lm_idata,
                              inplace=False)

# # get y-hat for new X data
preds.predictions['y'].median(dim=('chain', 'draw')).values

# # view y-hat as 94% HDI
az.plot_posterior(preds, group="predictions");

I’m guessing that there’s some trickery with how we set the co-ordinates that I can’t seem to puzzle out from the docs and sadly much of the code that I can find to reference no longer seems to work. Any advice much apprecated!

The syntax from the example you linked is somewhat out of date. coords are now deeply integrated into PyMC, so you don’t have to do all this juggling with az.from_pymc. Here’s a simplified model:

coords = {'features': ['treatment', 'cov1', 'cov2']}
coords_mutable = {'obs_id': np.arange(len(target))}

with pm.Model(coords=coords, coords_mutable=coords_mutable) as sk_lm:
    feature_data = pm.MutableData('feature_data', features, dims=('obs_id', 'features'))

    alpha = pm.Normal('alpha', mu=0, sigma=1)
    betas = pm.Normal('betas', mu=[40, 90, 50], sigma=5, dims='features')
    sigma = pm.Exponential("sigma", lam=1)

    mu = pm.Deterministic("mu", alpha + feature_data @ betas, dims='obs_id')

    y = pm.Normal("y", mu=mu, sigma=sigma, observed=target, dims='obs_id')

    idata = pm.sample(init='jitter+adapt_diag')

Since you’re planning to do out-of-sample stuff, you want to set the index coordinate with coords_mutable. That lets you change it down the road. Down the road is here:

new_data = np.random.randn(3, 3)

with sk_lm:
    pm.set_data({"feature_data": new_data}, 
                coords={'obs_id':np.arange(3) + len(target)})
    idata = pm.sample_posterior_predictive(idata, predictions=True, extend_inferencedata=True)

Since I used coords_mutable, I can pass new coords to pm.set_data. This is a common gotcha with out-of-sample prediction – you have to make sure the shapes of the targets somehow get updated to match the new data, even though you’re not using targets. This is discussed a bit more in-depth here.

1 Like