# Linear regression: prediction on holdout dataset

Hi there,

I’ve built this simple linear regression model:

# X_train is my data matrix, shape (n_samples, n_predictors)
# I'm using the QR decomposition to decorrelate the predictors & make HMC's job easier
Q_train_, R_train_ = np.linalg.qr(X_train, mode='reduced')
Q_train = Q_train_ * np.sqrt(n_samples - 1)
R_train = R_train_ / np.sqrt(n_samples - 1)
R_train_inv = np.linalg.inv(R_train)

with pm.Model() as linear_model:
# std for likelihood
s = pm.HalfNormal('s', 2.)
# slopes (one per predictor)
beta = pm.Normal('beta', 0., 2., shape=n_predictors)
t_beta = pm.Deterministic('t_beta', pm.math.dot(R_train_inv, beta))
# single intercept
alpha = pm.Normal('alpha', mu=0., sd=2.)
# expected value
y_hat = alpha + pm.math.dot(Q_train, beta)
# data likelihood
y = pm.StudentT('y', nu=2, mu=y_hat, sd=s, observed=y_train)
# prior predictive
prior_predictive = pm.sample_prior_predictive(samples=1000)
# sample
trace = pm.sample(draws=1000, chains=2, cores=2, tune=1000)
# PPC
posterior_predictive = pm.sample_posterior_predictive(trace)

What I want is to predict the regression target y_test for a set of “holdout” samples X_test. I’m looking for an estimate of the uncertainty in y_test. What’s the simplest way to achieve this? Do I have to use shared Theano variables? My test data are not available at train time so what I’ll be doing is (1) fit & save the model (pickle, or is there a better option?) and (2) restore the model each time I receive new data to forecast on. The model takes a considerable time to fit so re-doing the fitting at forecast time is not an option.

Thank you!

PS: I’ve scanned this forum for similar questions, e.g.

but none of these seem to be quite what I’m looking for.

1 Like

Hi,
I think you should be able to wrap X_train (well, actually Q_train and R_train_inv, which you use in the model) in the pm.Data container, as shown here.
Then, ArviZ’s to/from net_cdf functions should come in handy, and you set the new X data after having computed the new QR decomposition to predict y_test with pm.sample_posterior_predictive.

To sum up, at train time:

with pm.Model() as linear_model:
...
R_inv = pm.Data('R_inv', R_train_inv)
t_beta = pm.Deterministic('t_beta', pm.math.dot(R_inv, beta)
...
Q = pm.Data('Q', Q_train)
y_hat = alpha + pm.math.dot(Q, beta)
...
trace = pm.sample(draws=1000, chains=2, cores=2, tune=1000)
idata = az.from_pymc3(trace)
# out of model context
az.to_netcdf(idata, "path/to/idata")

Then, at forecast time:

idata = az.from_netcdf("path/to/idata")
# extract model object from idata to define context:
with linear_model:
pm.set_data({'R_inv': R_test_inv, "Q": Q_test})
y_test = pm.sample_posterior_predictive(idata)

The one doubt I have with this workflow, is whether Arviz InferenceData retains the model object information, so that you can reopen the context for posterior predictive sampling when reloading the idata. @OriolAbril, do you confirm it’s possible?

2 Likes

Interesting, thank you! I wonder if your approach would also work with a hierarchical model (as described in this thread)?

# std for likelihood
s = pm.HalfNormal('sigma', 2.)
# covariance matrix
packed_L_Omega = pm.LKJCholeskyCov('packed_L_Omega', n=n_predictors, eta=1, sd_dist=pm.HalfNormal.dist(1.))
L_Omega = pm.expand_packed_triangular(n_predictors, packed_L_Omega)
# group mean
mu = pm.Normal('mu', 0., 5., shape=(n_predictors, n_groups))
# group slopes (non-centered parameterization)
beta_base = pm.Normal('beta_base', 0., 1., shape=(n_predictors, n_groups))
beta = pm.Deterministic('beta', mu + pm.math.dot(L_Omega, beta_base))
t_beta = pm.Deterministic('t_beta', pm.math.dot(R_train_inv, beta))
# group intercepts
alpha = pm.Normal('alpha', 0., 5., shape=(n_samples, n_groups))
# likelihood:
# group_idx is a `(n_samples,)` array of integer group indices
y_hat = alpha[group_idx] + (Q_train * beta[:, group_idx].T).sum(axis=-1)
y = StudentT('y', nu=2, mu=y_hat, sd=s, observed=y_train)

The n_groups doesn’t change between train and test time. group_idx does change, though - in accordance with the new X_test. Thoughts?

Yeah I do stuff like that a lot, so I think it should work here – you could have some shape issues to take care of, but I’m not even sure since n_groups doesn’t change.

Note that if you’re running PyMC3 <= 3.8 you’ll have to cast group_idx as int: pm.intX(pm.Data('group_idx', group_idx)). This is taken care of automatically on master and on future 3.9

Also on master and future releases, you can parametrize Cholesky Cov more easily, getting back the expanded Cholesky factor, the correlation matrix and the standard deviations:

L_Omega, Rho, sigmas = pm.LKJCholeskyCov('packed_L_Omega', n=n_predictors, eta=1, sd_dist=pm.HalfNormal.dist(1.), compute_corr=True)
2 Likes

The model cannot be stored inside the netcdf file, to be able to do this workflow it should be saved separately with pickle (I am quite sure I have seen examples of this, but I don’t remember where right now)

Yeah, the classical way is to do something like that I think:

with open("file", "wb") as f:
pickle.dump({"model": model, "trace": trace}, f)

with open("file", "rb") as f:
model, trace = data["model"], data["trace"]

The workflow with netcdf is more robust and cross-platform though. Do you think it would be possible to add this functuionality to ArviZ (i.e saving the model in netcdf)?

2 Likes

Moreover, in development PyMC3 or PyMC3>=3.9, there would be no need to save the trace if storing the inference data object, the posterior group can be passed to sample_posterior_predictive. If possible, I’d recommend this approach.

I think the model object cannot be stored in a NetCDF file, but we can try to dig a little deeper to confirm.

2 Likes

Very useful information - thank you both!

When do you expect pymc3 3.9 to be released?

We’re in the final stretch now, but I can’t really say exactly – probably by the end of June? But it’s a soft deadline