How do I predict on new, unseen data using GLM?

OriolAbril · July 11, 2020, 5:03pm

The results in predictions will not represent predictions for the y values but for its mean. In general the quantity of interest is y and not its mean which would require generating random samples from a gaussian with mean equal to pt['Intercept'] + pt['x[0]']*X[:,0] + pt['x[1]']*X[:,1] and stardard deviation equal to pt['sd']. If you were only interested in the mean as a point estimate, then both approaches would return the same result, their uncertainties however, can be radically different (see plot at the bottom of the answer).

Moreover, I want to note that these kind of operations can be greatly simplified using ArviZ (with xarray under the hood). I am assuming that the loop in range(n_samples) is to avoid broadcasting issues. With xarray dimensions are named and can therefore be aligned and broadcasted automatically. The complete code example is available in this notebook. The first step (skipped here) is to convert PyMC3 trace to ArviZ InferenceData, then initialize the new_data as an xarray object and finally apply the same formula used in the loop above and let xarray broadcast.

n = 5
new_data_0 = xr.DataArray(
    rng.uniform(1, 1.5, size=n),
    dims=["pred_id"]
)
new_data_1 = xr.DataArray(
    rng.uniform(1, 1.5, size=n),
    dims=["pred_id"]
)
pred_mean = (
    idata.posterior["Intercept"] +
    idata.posterior["x[0]"] * new_data_0 +
    idata.posterior["x[1]"] * new_data_1
)

We now have the means of the predicted y values. To get the actual predictions we need to get draws from Normal(\text{pred_mean}, sd). We can do this combining numpy and xarray.

predictions = xr.apply_ufunc(lambda mu, sd: rng.normal(mu, sd), pred_mean, idata.posterior["sd"])

To illustrate the difference between the means and the y values, we can compare the distributions of pred_mean and predictions:

Topic		Replies	Views
How do I predict on new, unseen real data using pm.sample_posterior_predictive? Questions	13	8291	January 7, 2021
Sampling from glm posterior Questions	1	438	July 10, 2018
Is this the correct way to do multivariate regression without using GLM? Questions	2	477	July 20, 2020
Predicting on new data with gp.conditional Questions	7	1309	November 10, 2020
GP Predict Point Questions	9	986	March 6, 2020

How do I predict on new, unseen data using GLM?

Related topics