Using sample posterior predictive on new data

Hello! I am very new to PyMC and bayesian modeling in general. I am currently following some of the example code used in rethinking statistics in chapter 8 and I’m trying to replicate it in PyMC but I am running into dimensionality errors when I am trying to extend my model to a new data set. I saw a post that already addressed this topic Setting new data for predictions, conflicting size with dims - Questions / version agnostic - PyMC Discourse but I am still confused on how to fix this issue. Any clarification would be appreciated. Here is my model set up and the code used to run the posterior predictive:

continent_labels, continent = pd.factorize(df_standard.cont_africa)

coord = {
    "features": ["rugged_std"],
    "obs_id": np.arange(df_standard.shape[0]),
    "continent": continent.values
}
with pm.Model(coords=coord) as m8_3:
    rugged_std = pm.Data("rugged_std", df_standard.rugged_std.values, dims="obs_id")
    continent_indx = pm.Data("continent_indx", continent_labels, dims="obs_id")

    # priors
    alpha = pm.Normal("alpha", mu=1, sigma=0.1, dims="continent")
    beta = pm.Normal("beta", mu = 0, sigma=0.3, dims = "features")
    sigma = pm.Exponential("sigma", 1)

    # Determenistic
    mu = pm.Deterministic("mu", alpha[continent_indx] + (rugged_std - rugged_std.mean())* beta[0], dims="obs_id")

    # Liklelihood
    y = pm.Normal("y", mu=mu, sigma=sigma, observed=df_standard.log_gdp_std ,dims="obs_id")

with m8_3:
    idata3 = pm.sample_prior_predictive(draws=100)
    idata3 = pm.sample(idata_kwargs={"log_likelihood": True})
    idata3.extend(pm.sample_posterior_predictive(idata3))

I am getting the error at this code chunck:

rugged_seq = np.linspace(-0.1, 1.1, 30)
continent_pred = np.repeat(0, len(rugged_seq))

with m8_3:
    pm.set_data({
        "rugged_std": rugged_seq,
        "continent_indx": continent_pred
    }, coords = {"obs_id": np.arange(rugged_seq.shape[0])})

    mu_pred = pm.sample_posterior_predictive(idata3, var_names=["mu"])

With this error message:

ValueError: conflicting sizes for dimension 'obs_id': length 170 on the data but length 30 on coordinate 'obs_id'

Thanks for any help!

Hi @Alexander_Grunewald,

I believe the mismatch is on your target variable y. You can either pass in zeros of the new shape into pm.set_data() (These aren’t used in the computation of your posterior predictive)

pm.set_data({
        "rugged_std": rugged_seq,
        "continent_indx": continent_pred,
        "y": np.zeros_like(rugged_seq)
    }, coords = {"obs_id": np.arange(rugged_seq.shape[0])})

or you can pass the argument predictions=True into pm.sample_posterior_predictive():

mu_pred = pm.sample_posterior_predictive(idata3, var_names=["mu"], predictions=True)
1 Like

y isn’t a pm.Data in the provided code, so you won’t be able to call set_data on it. You can make a y_data = pm.Data('y_data', df_standard.log_gdp_std, dims=['obs_id']), then pass y_data to set_data.

But in general this is a “sharp edge” of PyMC. When doing out of sample prediction, you need to pass in dummy data to update the static shape of the targets, even though the predictions won’t be conditioned on the observed values.

Ah yes, I totally missed that the target was not made into a pm.Data() object. Thank you for catching that @jessegrabowski!

I also want to add one thing that I have experienced in the past. That is if you have missing data in your target variable and you want automatic imputation then you won’t be able to turn your target into a pm.Data() object. In that case I believe that you need to pass in the target directly to your likelihood and specify a new model specifically for out of sample predictions. Here is a resource that I go back to time and time again for out of sample predictions.

2 Likes

Thank you @jessegrabowski and @Dekermanjian. setting the dummy data for the y worked!