Out of sample predictions from a pickled model

After obtaining a trace from my model, I can change the Theano predictors to generate out-of-sample predictions, as described in the docs:
https://docs.pymc.io/notebooks/posterior_predictive.html

# Changing values here will also change values in the model
predictors_shared.set_value(predictors_out_of_sample)
# Simply running PPC will use the updated values and do prediction
ppc = pm.sample_ppc(trace, model=model, samples=100)

I want to save my model today and use it for out-of-sample predictions next week. How can I achieve this?

Hereā€™s what Iā€™ve tried:

  1. Pickle model and trace so I can load them later, as described here: https://stackoverflow.com/a/44768217
  2. Update predictors in loaded model - how? Theano set_value() method doesnā€™t make sense from a loaded model and trace. I tried making a 2nd copy of my model using out-of-sample predictors as input.
  3. Run sample_ppc() with out-of-sample predictors. I tried running sample_ppc() on my 2nd model with out-of-sample predictors using the trace from my previously trained model. The two models are specified exactly the same but with different predictor samples. This fails due to broadcasting different sample lengths.

This seems like a common use case for anyone doing predictive modeling. How can I make out-of-sample predictions from a model and trace that I saved previously? Itā€™s very inefficient to retrain my model from scratch every time I want to make out of sample predictions.

1 Like

I dont think you can do set_value() after you load a pickled model.

The easiest way I can think of is: just save the trace, and rebuild the model for prediction using sample_ppc

Yeah, definitely canā€™t set_value.

Not sure what you mean about rebuilding the model using sample_ppc on the trace. Doesnā€™t sample_ppc require a model? I am already able to load the model and trace and use sample_ppc for in-sample posterior testing; just canā€™t do out-of-sample prediction.

by ā€œrebuild the modelā€ I meant you run again:

with pm.Model() as model:
    ...

And everything else follows as before. So it is as if you rerun everything but instead of actually sampling you load the trace from before.

Oh, thatā€™s exactly what I tried. When using sample_ppc with old trace and new instance of the model (with different predictor data), I get broadcast errors unless my predictor data is exactly the same size as the original data. Isnā€™t this why theano set_value() method is usually recommended?

It looks like the problem is using a trace from one instance of my model for posterior sampling with another instance. I made a simple model where this works fine, and a slightly more complex model where it does not work.

The ā€œnoisyā€ model randomizes a noise distribution for each observation, which can cause shape issues when changing observations.

The example below works when using the simple_model() spec, but not the noisy_model() spec.

def simple_model(observed, predictor, shape_override=None):
    # Generate simple model from given sample of observed data and a predictor variable
    with pm3.Model() as model:
        mu_alpha = pm3.Normal('mu_alpha', mu=0., sd=50.)
        mu_beta = pm3.Normal('mu_beta', mu=0., sd=50.)
        mu = pm3.Deterministic(
            'mu', mu_alpha + mu_beta * predictor)
        sigma = pm3.HalfNormal('sigma', sd=50.)
        
        pm3.Normal('target', mu=mu, sd=sigma, observed=observed)
        
    return model

def noisy_model(observed, predictor, shape_override=None):
    # Generate simple model from given sample of observed data and a predictor variable
    with pm3.Model() as model:
        mu_alpha_driver = pm3.Normal('mu_alpha_driver', mu=0., sd=50.)
        mu_alpha = pm3.Normal('mu_alpha', mu=0., sd=mu_alpha_driver, shape=len(observed))
        mu_beta = pm3.Normal('mu_beta', mu=0., sd=50.)
        mu = pm3.Deterministic(
            'mu', mu_alpha + mu_beta * predictor)
        sigma = pm3.HalfNormal('sigma', sd=50.)
        
        pm3.Normal('target', mu=mu, sd=sigma, observed=observed)
        
    return model

# Fit model
# model_insample = simple_model(observed=np.random.randn(10), predictor=np.random.randn(10))
# model_outofsample = simple_model(observed=np.random.randn(5), predictor=np.random.randn(5))
model_insample = noisy_model(observed=np.random.randn(10), predictor=np.random.randn(10))
model_outofsample = noisy_model(observed=np.random.randn(5), predictor=np.random.randn(5))
njobs = 1
with model_insample:
    trace_insample = pm3.sample(4000, tune=500, njobs=njobs, chains=njobs, cores=njobs)

# Works
with model_insample:
    post_pred_insample = pm3.sample_ppc(trace_insample, samples=500)

# Fails
with model_outofsample:
    post_pred_outofsample = pm3.sample_ppc(trace_insample, samples=500)

In your noise model, one of the node has a set shape
mu_alpha = pm3.Normal('mu_alpha', mu=0., sd=mu_alpha_driver, shape=len(observed))

So if you change the shape between the training set and the testing set this is not going to work.

In the model Iā€™m trying to use, using this shape argument is critical to the model specification. It changes the model dynamics significantly. But I still need to perform out-of-sample posterior predictions with this model.

Is there a natural way to do this in pymc3 framework?

Hmm, if you have nodes that depending on these specifications and change between inference and prediction, I donā€™t think it is still valid to do OOS prediction this way, as the shape change would modify also the model logp.

2 Likes

Thanks, very good point.

@junpenglao, what approach would you recommend for out of sample prediction with time series distributions? From the examples I saw, the shape must be set when constructing the RV from a time series distribution during the model specification. I would like to test the a model that I have trained one stream of data, on a different stream of data which potentially could have a different duration.

After you fit the model, you can do something like:

with fitted_model:
    new_timeserie = ... # define new time serie here with different shape
    ppc = pm.sample_ppc(..., var=[new_timeserie]) 

This is essentially what GP is doing in conditional: http://docs.pymc.io/notebooks/GP-Marginal.html#Using-.conditional

1 Like

Thanks!

I was able to .set_value() on pickled models if I saved the shared variables in a dictionary.
for example, I created a dict with attrs ā€œmodelā€, ā€œtraceā€ and ā€œXā€ and ā€œYā€.

If I specify dict[ā€˜Xā€™].set_value() it worked for me.
At least, when I do that on a Binomial model, Iā€™m able to change the n parameter of observed to whatever I like that way.

@junpenglao
Iā€™m still struggling to get it to work for X and Y varialbes thoughā€¦ for some reason, no mater what I do to those the prediction is the sameā€¦

Turns out I was having the same pm.Determinsitic issue mentioned above.
Upgrade pymc3 to master dev branch and reloading fixed it.

1 Like