Since you just have a linear model, there’s no need to have a loop over the features. It’s best practice to vectorize everywhere you can. In this case, it also has the advantage of making it more apparent how to use the data containers. Here is your model using broadcast multiplication across data containers:
coords = {'feature':['feature_1', 'feature_2']}
basic_model =pm.Model(coords=coords)
with basic_model:
X = pm.MutableData('X', X_train.values)
y = pm.MutableData('y', y_train.values)
betas = pm.HalfNormal('coeff', sigma=2, dims=['feature'])
decay = pm.Beta('decay', alpha=3, beta=3, dims=['feature'])
contribution = pm.Deterministic('contribution', X * decay * betas, dims=[None, 'feature'])
mu = contribution.sum(axis=-1)
sigma = pm.HalfNormal('sigma', sigma=1)
obs = pm.Normal("y_hat", mu=mu, sigma=sigma, observed=y, size=X.shape[0])
with basic_model:
idata = pm.sample(init='jitter+adapt_diag_grad')
idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True)
Note the size=X.shape[0]
line in the likelihood term. Now you can get predictions for your out of sample values:
with basic_model:
pm.set_data({'X':X_test, 'y':y_test})
idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True, predictions=True)
And made a KDE plot. The PPD looks good in this case, since the test values are from the same distribution as the training values (I just generated some samples and split it with sklearn.model_selection.train_test_split
). I had to hack together the PPD because arviz doesn’t support using the predictions
group to plot az.plot_ppd
.
import seaborn as sns
fig, ax = plt.subplots()
sns.kdeplot(y_test.values, ax=ax, color='k', lw=2, zorder=10000)
ppd_test = idata.predictions.y_hat.stack(sample=['chain', 'draw']).values
for sample in ppd_test.T[:250]:
sns.kdeplot(sample, ax=ax, color='tab:blue', alpha=0.15)
ax.set_title('Out of Sample PPC')
plt.show()
Another application is to sample the space of observed data more densely, and use it to draw nice HDIs. Here I marginalize over each feature in turn (aka “set it to its mean value”), use set_data to sample from the model, and plot the resulting HDI, along with the training and test data.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 2, figsize=(14,4), sharey=True)
for i, axis in enumerate(fig.axes):
x = X_train.values[:, i]
axis.scatter(x, y_train, label='Training')
axis.scatter(X_test.values[:, i], y_test.values, color='tab:orange', label='Test')
x_mean = np.ones((1000, 2)) * X_train.values.mean(axis=0)
x_grid = np.linspace(x.min(), x.max(), 1000)
x_mean[:, i] = x_grid
with basic_model:
pm.set_data({'X':x_mean})
temp_idata = pm.sample_posterior_predictive(idata)
hdi = az.hdi(temp_idata.posterior_predictive, hdi_prob=0.69).y_hat
axis.fill_between(x_grid, *hdi.values.T, alpha=0.25, color='tab:blue', label = '69% HDI')
axis.set(xlabel=f'Feature {i}')
ax[0].legend()
plt.show()
Note that I didn’t need to supply a value of y
when I did pm.set_data
in this code block. That’s because I passed shape=X.shape[0]
when I built the model, so PyMC is smart enough to dynamically resize y_hat
. If you omitted that, you would have to pass in some placeholder values of the correct size for y.