Linear regression: prediction on holdout dataset

Hi there,

I’ve built this simple linear regression model:

# X_train is my data matrix, shape (n_samples, n_predictors)
# I'm using the QR decomposition to decorrelate the predictors & make HMC's job easier
Q_train_, R_train_ = np.linalg.qr(X_train, mode='reduced')
Q_train = Q_train_ * np.sqrt(n_samples - 1)
R_train = R_train_ / np.sqrt(n_samples - 1)
R_train_inv = np.linalg.inv(R_train)

with pm.Model() as linear_model:
    # std for likelihood
    s = pm.HalfNormal('s', 2.)
    # slopes (one per predictor)
    beta = pm.Normal('beta', 0., 2., shape=n_predictors)
    t_beta = pm.Deterministic('t_beta',, beta))
    # single intercept
    alpha = pm.Normal('alpha', mu=0., sd=2.)
    # expected value
    y_hat = alpha +, beta)
    # data likelihood
    y = pm.StudentT('y', nu=2, mu=y_hat, sd=s, observed=y_train)
    # prior predictive
    prior_predictive = pm.sample_prior_predictive(samples=1000)
    # sample
    trace = pm.sample(draws=1000, chains=2, cores=2, tune=1000)
    # PPC
    posterior_predictive = pm.sample_posterior_predictive(trace)

What I want is to predict the regression target y_test for a set of “holdout” samples X_test. I’m looking for an estimate of the uncertainty in y_test. What’s the simplest way to achieve this? Do I have to use shared Theano variables? My test data are not available at train time so what I’ll be doing is (1) fit & save the model (pickle, or is there a better option?) and (2) restore the model each time I receive new data to forecast on. The model takes a considerable time to fit so re-doing the fitting at forecast time is not an option.

Thank you!

PS: I’ve scanned this forum for similar questions, e.g.

but none of these seem to be quite what I’m looking for.

1 Like

I think you should be able to wrap X_train (well, actually Q_train and R_train_inv, which you use in the model) in the pm.Data container, as shown here.
Then, ArviZ’s to/from net_cdf functions should come in handy, and you set the new X data after having computed the new QR decomposition to predict y_test with pm.sample_posterior_predictive.

To sum up, at train time:

with pm.Model() as linear_model:
    R_inv = pm.Data('R_inv', R_train_inv)
    t_beta = pm.Deterministic('t_beta',, beta)
    Q = pm.Data('Q', Q_train)
    y_hat = alpha +, beta)
    trace = pm.sample(draws=1000, chains=2, cores=2, tune=1000)
    idata = az.from_pymc3(trace)
# out of model context
az.to_netcdf(idata, "path/to/idata")

Then, at forecast time:

idata = az.from_netcdf("path/to/idata")
# extract model object from idata to define context:
with linear_model:
    pm.set_data({'R_inv': R_test_inv, "Q": Q_test})
    y_test = pm.sample_posterior_predictive(idata)

The one doubt I have with this workflow, is whether Arviz InferenceData retains the model object information, so that you can reopen the context for posterior predictive sampling when reloading the idata. @OriolAbril, do you confirm it’s possible?


Interesting, thank you! I wonder if your approach would also work with a hierarchical model (as described in this thread)?

    # std for likelihood
    s = pm.HalfNormal('sigma', 2.)
    # covariance matrix
    packed_L_Omega = pm.LKJCholeskyCov('packed_L_Omega', n=n_predictors, eta=1, sd_dist=pm.HalfNormal.dist(1.))
    L_Omega = pm.expand_packed_triangular(n_predictors, packed_L_Omega)
    # group mean
    mu = pm.Normal('mu', 0., 5., shape=(n_predictors, n_groups))
    # group slopes (non-centered parameterization)
    beta_base = pm.Normal('beta_base', 0., 1., shape=(n_predictors, n_groups))
    beta = pm.Deterministic('beta', mu +, beta_base))
    t_beta = pm.Deterministic('t_beta',, beta))
    # group intercepts
    alpha = pm.Normal('alpha', 0., 5., shape=(n_samples, n_groups))
    # likelihood:
    # group_idx is a `(n_samples,)` array of integer group indices
    y_hat = alpha[group_idx] + (Q_train * beta[:, group_idx].T).sum(axis=-1)
    y = StudentT('y', nu=2, mu=y_hat, sd=s, observed=y_train)

The n_groups doesn’t change between train and test time. group_idx does change, though - in accordance with the new X_test. Thoughts?

Yeah I do stuff like that a lot, so I think it should work here – you could have some shape issues to take care of, but I’m not even sure since n_groups doesn’t change.

Note that if you’re running PyMC3 <= 3.8 you’ll have to cast group_idx as int: pm.intX(pm.Data('group_idx', group_idx)). This is taken care of automatically on master and on future 3.9 :wink:

Also on master and future releases, you can parametrize Cholesky Cov more easily, getting back the expanded Cholesky factor, the correlation matrix and the standard deviations:

L_Omega, Rho, sigmas = pm.LKJCholeskyCov('packed_L_Omega', n=n_predictors, eta=1, sd_dist=pm.HalfNormal.dist(1.), compute_corr=True)

The model cannot be stored inside the netcdf file, to be able to do this workflow it should be saved separately with pickle (I am quite sure I have seen examples of this, but I don’t remember where right now)

Yeah, the classical way is to do something like that I think:

with open("file", "wb") as f:
    pickle.dump({"model": model, "trace": trace}, f)

with open("file", "rb") as f:
    data = pickle.load(f)
model, trace = data["model"], data["trace"]

The workflow with netcdf is more robust and cross-platform though. Do you think it would be possible to add this functuionality to ArviZ (i.e saving the model in netcdf)?


Moreover, in development PyMC3 or PyMC3>=3.9, there would be no need to save the trace if storing the inference data object, the posterior group can be passed to sample_posterior_predictive. If possible, I’d recommend this approach.

I think the model object cannot be stored in a NetCDF file, but we can try to dig a little deeper to confirm.


Very useful information - thank you both!

When do you expect pymc3 3.9 to be released?

We’re in the final stretch now, but I can’t really say exactly – probably by the end of June? But it’s a soft deadline :man_shrugging: