How do I predict on new, unseen data using GLM?

The results in predictions will not represent predictions for the y values but for its mean. In general the quantity of interest is y and not its mean which would require generating random samples from a gaussian with mean equal to pt['Intercept'] + pt['x[0]']*X[:,0] + pt['x[1]']*X[:,1] and stardard deviation equal to pt['sd']. If you were only interested in the mean as a point estimate, then both approaches would return the same result, their uncertainties however, can be radically different (see plot at the bottom of the answer).

Moreover, I want to note that these kind of operations can be greatly simplified using ArviZ (with xarray under the hood). I am assuming that the loop in range(n_samples) is to avoid broadcasting issues. With xarray dimensions are named and can therefore be aligned and broadcasted automatically. The complete code example is available in this notebook. The first step (skipped here) is to convert PyMC3 trace to ArviZ InferenceData, then initialize the new_data as an xarray object and finally apply the same formula used in the loop above and let xarray broadcast.

n = 5
new_data_0 = xr.DataArray(
    rng.uniform(1, 1.5, size=n),
    dims=["pred_id"]
)
new_data_1 = xr.DataArray(
    rng.uniform(1, 1.5, size=n),
    dims=["pred_id"]
)
pred_mean = (
    idata.posterior["Intercept"] +
    idata.posterior["x[0]"] * new_data_0 +
    idata.posterior["x[1]"] * new_data_1
)

We now have the means of the predicted y values. To get the actual predictions we need to get draws from Normal(\text{pred_mean}, sd). We can do this combining numpy and xarray.

predictions = xr.apply_ufunc(lambda mu, sd: rng.normal(mu, sd), pred_mean, idata.posterior["sd"])

To illustrate the difference between the means and the y values, we can compare the distributions of pred_mean and predictions:

2 Likes