Pymc5 out of sample

Since you just have a linear model, there’s no need to have a loop over the features. It’s best practice to vectorize everywhere you can. In this case, it also has the advantage of making it more apparent how to use the data containers. Here is your model using broadcast multiplication across data containers:

coords = {'feature':['feature_1', 'feature_2']}

basic_model =pm.Model(coords=coords)

with basic_model:
    X = pm.MutableData('X', X_train.values)
    y = pm.MutableData('y', y_train.values)
    
    betas = pm.HalfNormal('coeff', sigma=2, dims=['feature'])
    decay = pm.Beta('decay', alpha=3, beta=3, dims=['feature'])
    contribution = pm.Deterministic('contribution', X * decay * betas, dims=[None, 'feature'])
    
    mu = contribution.sum(axis=-1)
    sigma = pm.HalfNormal('sigma', sigma=1)
    
    obs = pm.Normal("y_hat", mu=mu, sigma=sigma, observed=y, size=X.shape[0])   

with basic_model:
    idata = pm.sample(init='jitter+adapt_diag_grad')
    idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True)

Note the size=X.shape[0] line in the likelihood term. Now you can get predictions for your out of sample values:

with basic_model:
    pm.set_data({'X':X_test, 'y':y_test})
    idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True, predictions=True)

And made a KDE plot. The PPD looks good in this case, since the test values are from the same distribution as the training values (I just generated some samples and split it with sklearn.model_selection.train_test_split). I had to hack together the PPD because arviz doesn’t support using the predictions group to plot az.plot_ppd.

import seaborn as sns
fig, ax = plt.subplots()
sns.kdeplot(y_test.values, ax=ax, color='k', lw=2, zorder=10000)
ppd_test = idata.predictions.y_hat.stack(sample=['chain', 'draw']).values

for sample in ppd_test.T[:250]:
    sns.kdeplot(sample, ax=ax, color='tab:blue', alpha=0.15)
ax.set_title('Out of Sample PPC')
plt.show()

image

Another application is to sample the space of observed data more densely, and use it to draw nice HDIs. Here I marginalize over each feature in turn (aka “set it to its mean value”), use set_data to sample from the model, and plot the resulting HDI, along with the training and test data.

import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 2, figsize=(14,4), sharey=True)

for i, axis in enumerate(fig.axes):
    x = X_train.values[:, i]
    axis.scatter(x, y_train, label='Training')
    axis.scatter(X_test.values[:, i], y_test.values, color='tab:orange', label='Test')

    x_mean = np.ones((1000, 2)) * X_train.values.mean(axis=0)
    x_grid = np.linspace(x.min(), x.max(), 1000)
    x_mean[:, i] = x_grid

    with basic_model:
        pm.set_data({'X':x_mean})
        temp_idata = pm.sample_posterior_predictive(idata)
        hdi = az.hdi(temp_idata.posterior_predictive, hdi_prob=0.69).y_hat

    axis.fill_between(x_grid, *hdi.values.T, alpha=0.25, color='tab:blue', label = '69% HDI')
    axis.set(xlabel=f'Feature {i}')

ax[0].legend()
plt.show()

Note that I didn’t need to supply a value of y when I did pm.set_data in this code block. That’s because I passed shape=X.shape[0] when I built the model, so PyMC is smart enough to dynamically resize y_hat. If you omitted that, you would have to pass in some placeholder values of the correct size for y.

3 Likes