How do I predict on new, unseen data using GLM?

Hi, I’m trying to do a simple multivariate regression using GLM. I’m able to set up the model and sample from posterior, but I’m very confused with how to actually generate new predictions.

Here’s an example…

data = pd.DataFrame({'var1':np.random.random_sample(1000),'var2':np.random.random_sample(1000),'y':np.random.random_sample(1000)})
x = data[['var1','var2']]
y = data['y']

with Model() as model:
    glm.GLM.from_formula('y ~ x', data)
    trace = sample(3000, cores=2)

samples = pm.sample_posterior_predictive(trace, 200, model)

All good till there. But now say I have new data that I want to create predictions for.

new_data = np.random.random_sample(100)

How would I go about doing this?


You can extract the coefficient samples from the trace and use those on new data points. That might look something like this:

n = 100
X = np.ones([n,2])
n_samples = len(trace)
predictions = np.empty([n, n_samples])
for i in range(n_samples):
    pt = trace[i]
    predictions[:,i] = pt['Intercept'] + pt['x[0]']*X[:,0] + pt['x[1]']*X[:,1]

Alternately, you can write the model using the non-GLM syntax which lets you more easily sample predictions on new data. There’s some examples here

1 Like

The results in predictions will not represent predictions for the y values but for its mean. In general the quantity of interest is y and not its mean which would require generating random samples from a gaussian with mean equal to pt['Intercept'] + pt['x[0]']*X[:,0] + pt['x[1]']*X[:,1] and stardard deviation equal to pt['sd']. If you were only interested in the mean as a point estimate, then both approaches would return the same result, their uncertainties however, can be radically different (see plot at the bottom of the answer).

Moreover, I want to note that these kind of operations can be greatly simplified using ArviZ (with xarray under the hood). I am assuming that the loop in range(n_samples) is to avoid broadcasting issues. With xarray dimensions are named and can therefore be aligned and broadcasted automatically. The complete code example is available in this notebook. The first step (skipped here) is to convert PyMC3 trace to ArviZ InferenceData, then initialize the new_data as an xarray object and finally apply the same formula used in the loop above and let xarray broadcast.

n = 5
new_data_0 = xr.DataArray(
    rng.uniform(1, 1.5, size=n),
new_data_1 = xr.DataArray(
    rng.uniform(1, 1.5, size=n),
pred_mean = (
    idata.posterior["Intercept"] +
    idata.posterior["x[0]"] * new_data_0 +
    idata.posterior["x[1]"] * new_data_1

We now have the means of the predicted y values. To get the actual predictions we need to get draws from Normal(\text{pred_mean}, sd). We can do this combining numpy and xarray.

predictions = xr.apply_ufunc(lambda mu, sd: rng.normal(mu, sd), pred_mean, idata.posterior["sd"])

To illustrate the difference between the means and the y values, we can compare the distributions of pred_mean and predictions:


Thanks so much for your replies!