How do I predict on new, unseen data using GLM?

mldl920 · July 10, 2020, 10:04pm

Hi, I’m trying to do a simple multivariate regression using GLM. I’m able to set up the model and sample from posterior, but I’m very confused with how to actually generate new predictions.

Here’s an example…

data = pd.DataFrame({'var1':np.random.random_sample(1000),'var2':np.random.random_sample(1000),'y':np.random.random_sample(1000)})
x = data[['var1','var2']]
y = data['y']

with Model() as model:
    glm.GLM.from_formula('y ~ x', data)
    trace = sample(3000, cores=2)

samples = pm.sample_posterior_predictive(trace, 200, model)

All good till there. But now say I have new data that I want to create predictions for.

new_data = np.random.random_sample(100)

How would I go about doing this?

Thanks.

ckrapu · July 11, 2020, 4:29am

You can extract the coefficient samples from the trace and use those on new data points. That might look something like this:

n = 100
X = np.ones([n,2])
n_samples = len(trace)
predictions = np.empty([n, n_samples])
for i in range(n_samples):
    pt = trace[i]
    predictions[:,i] = pt['Intercept'] + pt['x[0]']*X[:,0] + pt['x[1]']*X[:,1]

Alternately, you can write the model using the non-GLM syntax which lets you more easily sample predictions on new data. There’s some examples here

OriolAbril · July 11, 2020, 5:03pm

The results in predictions will not represent predictions for the y values but for its mean. In general the quantity of interest is y and not its mean which would require generating random samples from a gaussian with mean equal to pt['Intercept'] + pt['x[0]']*X[:,0] + pt['x[1]']*X[:,1] and stardard deviation equal to pt['sd']. If you were only interested in the mean as a point estimate, then both approaches would return the same result, their uncertainties however, can be radically different (see plot at the bottom of the answer).

Moreover, I want to note that these kind of operations can be greatly simplified using ArviZ (with xarray under the hood). I am assuming that the loop in range(n_samples) is to avoid broadcasting issues. With xarray dimensions are named and can therefore be aligned and broadcasted automatically. The complete code example is available in this notebook. The first step (skipped here) is to convert PyMC3 trace to ArviZ InferenceData, then initialize the new_data as an xarray object and finally apply the same formula used in the loop above and let xarray broadcast.

n = 5
new_data_0 = xr.DataArray(
    rng.uniform(1, 1.5, size=n),
    dims=["pred_id"]
)
new_data_1 = xr.DataArray(
    rng.uniform(1, 1.5, size=n),
    dims=["pred_id"]
)
pred_mean = (
    idata.posterior["Intercept"] +
    idata.posterior["x[0]"] * new_data_0 +
    idata.posterior["x[1]"] * new_data_1
)

We now have the means of the predicted y values. To get the actual predictions we need to get draws from Normal(\text{pred_mean}, sd). We can do this combining numpy and xarray.

predictions = xr.apply_ufunc(lambda mu, sd: rng.normal(mu, sd), pred_mean, idata.posterior["sd"])

To illustrate the difference between the means and the y values, we can compare the distributions of pred_mean and predictions:

mldl920 · July 13, 2020, 4:53pm

Thanks so much for your replies!

Topic		Replies	Views
How do I predict on new, unseen real data using pm.sample_posterior_predictive? Questions	13	8351	January 7, 2021
Sampling from glm posterior Questions	1	438	July 10, 2018
Is this the correct way to do multivariate regression without using GLM? Questions	2	477	July 20, 2020
Predicting on new data with gp.conditional Questions	7	1312	November 10, 2020
GP Predict Point Questions	9	989	March 6, 2020

How do I predict on new, unseen data using GLM?

Related topics