Linear regression: prediction on holdout dataset

mishooax · June 5, 2020, 7:33am

Hi there,

I’ve built this simple linear regression model:

# X_train is my data matrix, shape (n_samples, n_predictors)
# I'm using the QR decomposition to decorrelate the predictors & make HMC's job easier
Q_train_, R_train_ = np.linalg.qr(X_train, mode='reduced')
Q_train = Q_train_ * np.sqrt(n_samples - 1)
R_train = R_train_ / np.sqrt(n_samples - 1)
R_train_inv = np.linalg.inv(R_train)

with pm.Model() as linear_model:
    # std for likelihood
    s = pm.HalfNormal('s', 2.)
    # slopes (one per predictor)
    beta = pm.Normal('beta', 0., 2., shape=n_predictors)
    t_beta = pm.Deterministic('t_beta', pm.math.dot(R_train_inv, beta))
    # single intercept
    alpha = pm.Normal('alpha', mu=0., sd=2.)
    # expected value
    y_hat = alpha + pm.math.dot(Q_train, beta)
    # data likelihood
    y = pm.StudentT('y', nu=2, mu=y_hat, sd=s, observed=y_train)
    # prior predictive
    prior_predictive = pm.sample_prior_predictive(samples=1000)
    # sample
    trace = pm.sample(draws=1000, chains=2, cores=2, tune=1000)
    # PPC
    posterior_predictive = pm.sample_posterior_predictive(trace)

What I want is to predict the regression target y_test for a set of “holdout” samples X_test. I’m looking for an estimate of the uncertainty in y_test. What’s the simplest way to achieve this? Do I have to use shared Theano variables? My test data are not available at train time so what I’ll be doing is (1) fit & save the model (pickle, or is there a better option?) and (2) restore the model each time I receive new data to forecast on. The model takes a considerable time to fit so re-doing the fitting at forecast time is not an option.

Thank you!

PS: I’ve scanned this forum for similar questions, e.g.

but none of these seem to be quite what I’m looking for.

AlexAndorra · June 5, 2020, 8:25am

Hi,
I think you should be able to wrap X_train (well, actually Q_train and R_train_inv, which you use in the model) in the pm.Data container, as shown here.
Then, ArviZ’s to/from net_cdf functions should come in handy, and you set the new X data after having computed the new QR decomposition to predict y_test with pm.sample_posterior_predictive.

To sum up, at train time:

with pm.Model() as linear_model:
    ...
    R_inv = pm.Data('R_inv', R_train_inv)
    t_beta = pm.Deterministic('t_beta', pm.math.dot(R_inv, beta)
    ...
    Q = pm.Data('Q', Q_train)
    y_hat = alpha + pm.math.dot(Q, beta)
    ...
    trace = pm.sample(draws=1000, chains=2, cores=2, tune=1000)
    idata = az.from_pymc3(trace)
# out of model context
az.to_netcdf(idata, "path/to/idata")

Then, at forecast time:

idata = az.from_netcdf("path/to/idata")
# extract model object from idata to define context:
with linear_model:
    pm.set_data({'R_inv': R_test_inv, "Q": Q_test})
    y_test = pm.sample_posterior_predictive(idata)

The one doubt I have with this workflow, is whether Arviz InferenceData retains the model object information, so that you can reopen the context for posterior predictive sampling when reloading the idata. @OriolAbril, do you confirm it’s possible?

mishooax · June 5, 2020, 8:43am

Interesting, thank you! I wonder if your approach would also work with a hierarchical model (as described in this thread)?

    # std for likelihood
    s = pm.HalfNormal('sigma', 2.)
    # covariance matrix
    packed_L_Omega = pm.LKJCholeskyCov('packed_L_Omega', n=n_predictors, eta=1, sd_dist=pm.HalfNormal.dist(1.))
    L_Omega = pm.expand_packed_triangular(n_predictors, packed_L_Omega)
    # group mean
    mu = pm.Normal('mu', 0., 5., shape=(n_predictors, n_groups))
    # group slopes (non-centered parameterization)
    beta_base = pm.Normal('beta_base', 0., 1., shape=(n_predictors, n_groups))
    beta = pm.Deterministic('beta', mu + pm.math.dot(L_Omega, beta_base))
    t_beta = pm.Deterministic('t_beta', pm.math.dot(R_train_inv, beta))
    # group intercepts
    alpha = pm.Normal('alpha', 0., 5., shape=(n_samples, n_groups))
    # likelihood:
    # group_idx is a `(n_samples,)` array of integer group indices
    y_hat = alpha[group_idx] + (Q_train * beta[:, group_idx].T).sum(axis=-1)
    y = StudentT('y', nu=2, mu=y_hat, sd=s, observed=y_train)

The n_groups doesn’t change between train and test time. group_idx does change, though - in accordance with the new X_test. Thoughts?

AlexAndorra · June 5, 2020, 9:30am

Yeah I do stuff like that a lot, so I think it should work here – you could have some shape issues to take care of, but I’m not even sure since n_groups doesn’t change.

Note that if you’re running PyMC3 <= 3.8 you’ll have to cast group_idx as int: pm.intX(pm.Data('group_idx', group_idx)). This is taken care of automatically on master and on future 3.9

Also on master and future releases, you can parametrize Cholesky Cov more easily, getting back the expanded Cholesky factor, the correlation matrix and the standard deviations:

L_Omega, Rho, sigmas = pm.LKJCholeskyCov('packed_L_Omega', n=n_predictors, eta=1, sd_dist=pm.HalfNormal.dist(1.), compute_corr=True)

OriolAbril · June 5, 2020, 12:37pm

The model cannot be stored inside the netcdf file, to be able to do this workflow it should be saved separately with pickle (I am quite sure I have seen examples of this, but I don’t remember where right now)

AlexAndorra · June 8, 2020, 10:42am

Yeah, the classical way is to do something like that I think:

with open("file", "wb") as f:
    pickle.dump({"model": model, "trace": trace}, f)

with open("file", "rb") as f:
    data = pickle.load(f)
model, trace = data["model"], data["trace"]

The workflow with netcdf is more robust and cross-platform though. Do you think it would be possible to add this functuionality to ArviZ (i.e saving the model in netcdf)?

OriolAbril · June 8, 2020, 11:35am

Moreover, in development PyMC3 or PyMC3>=3.9, there would be no need to save the trace if storing the inference data object, the posterior group can be passed to sample_posterior_predictive. If possible, I’d recommend this approach.

I think the model object cannot be stored in a NetCDF file, but we can try to dig a little deeper to confirm.

mishooax · June 8, 2020, 1:41pm

Very useful information - thank you both!

When do you expect pymc3 3.9 to be released?

AlexAndorra · June 9, 2020, 8:41am

We’re in the final stretch now, but I can’t really say exactly – probably by the end of June? But it’s a soft deadline

Topic		Replies	Views
Predictions on holdout dataset with beta binomial regression Questions	12	1485	March 17, 2020
How to predict new values on hold-out data Questions	24	13343	July 22, 2020
Held-out prediction with latent variables whose dimension changes with the test set Questions	1	538	April 11, 2019
Prediction by Bayesian linear regression Questions	7	1787	April 12, 2021
Fail to predict on new/hold-out data with nested multilevel/hierarchical model v5 modeling , hierarchical , prediction	2	52	September 8, 2024

Linear regression: prediction on holdout dataset

Related topics