How to predict new values on hold-out data


#1

Based on the quickstart, one has to build a model with theano shared variables as one’s inputs, and then change those variables to your hold-out data after you have your trace, putting it in pm.sample_posterior_predictive(), to make predictions.

But is it possible to do this with a pre-existing trace that wasn’t set up in this way? I’m asking because my model took a very long time to fit, and redoing it again seems redundant.


How do we predict on new unseen groups in a hierarchical model in PyMC3?
#2

I depends very much on how your particular model was defined, and if the parameters that were traced get their shape from broadcasting some of the original training data (this latter situation will lead to this issue).

I’ll give you an example in which you can use the “old trace” to predict out of sample observations without having to use shared variables anywhere:

import numpy as np
import pymc3 as pm
from theano import tensor as tt
from matplotlib import pyplot as plt


def model_factory(x, y):
    x = np.asarray(x)
    y = np.asarray(y)
    with pm.Model() as model:
        beta = pm.Normal('beta', mu=0, sigma=1, shape=(2,))
        x_ = x.flatten()
        X = np.column_stack((np.ones_like(x_), x_))
        # If you enclose mu in Deterministic it will fail as #3346
        mu = tt.tensordot(X, beta, axes=[1, 0])
        pm.Normal('obs', mu=mu, sigma=1., observed=y)
    return model


BETA = np.array([1.4, -0.6])
train_x = np.random.randn(100) * 3
train_y = np.polyval(BETA[::-1], train_x)
hold_out_x = np.linspace(-10, 10, 51)
hold_out_y = np.polyval(BETA[::-1], hold_out_x)

# Perform the training to get the trace
with model_factory(train_x, train_y) as model:
    train_trace = pm.sample()
    pm.traceplot(train_trace)
    pm.plot_posterior(train_trace)
    ppc = pm.sample_posterior_predictive(train_trace)
    plt.figure()
    plt.plot(train_x, ppc['obs'].T, '.b', alpha=0.01)
    plt.plot(train_x, train_y, '-or')
    proxy_arts = [plt.Line2D([0], [0], marker='.', linestyle='', color='b'),
                  plt.Line2D([0], [0], marker='o', linestyle='-', color='r')]
    plt.legend(handles=proxy_arts, labels=['prediction', 'hold out data'])
    plt.title('Posterior predictive on the training set')

# Construct a new model with the held out data
with model_factory(hold_out_x, hold_out_y) as test_model:
    ppc = pm.sample_posterior_predictive(train_trace)
    plt.figure()
    plt.plot(hold_out_x, ppc['obs'].T, '.b', alpha=0.01)
    plt.plot(hold_out_x, hold_out_y, '-or')
    proxy_arts = [plt.Line2D([0], [0], marker='.', linestyle='', color='b'),
                  plt.Line2D([0], [0], marker='o', linestyle='-', color='r')]
    plt.legend(handles=proxy_arts, labels=['prediction', 'hold out data'])
    plt.title('Posterior predictive on the test set')

The trace plot looks like:


The plot posterior looks like:

The posterior predictive plot on the training data looks like this:
Figure_3
Finally, the posterior predictive plot on the held out data looks like this:
Figure_4

The only reason that this is possible is because none of the traced parameters (the variables that get sampled with pm.sample) have their shape determined by the training dataset. If for instance, you enclose mu in a deterministic as mu = pm.Deterministic('mu', tt.tensordot(X, beta, axes=[1, 0])), then mu.shape will depend on the original X's shape (independently of whether X is a shared tensor or not). This leads to the trace to hold mu values as a numpy array with a shape (and also values!) given by the training data, so when you try to sample the posterior predictive later on, you get a shape mismatch error.

If your model defines some parameters whose shape does not depend on the training data, except of course the observed variable, then you should be able to use the old train_trace as input to the sample_posterior_predictive of the model that was built with the held out data.


#3

So, I am guessing that a basic linear regression set up like this (with n=1460 and 270 variables/columns in the dataframe), with priors given to all the variables, is doomed because the shape was specified?

with pm.Model() as model:
    
    # Priors 
    beta = pm.Normal('beta', mu=0, sd=5, shape=(270, 1))
    intercept = pm.Normal('intercept', mu=0, sd=3)
    
    std = pm.HalfNormal('std', sd=100)

    # Likelihood
    price = intercept + pm.math.dot(train_norm, beta)
    
    y_lik = pm.Normal('y_lik', mu=price, sd=std, observed=np.log(SalePrice))

    trace = pm.sample(cores=1)

#4

No, it will work fine. Sorry I was not precise enough. Your beta variable will always be constrained to have a shape=(270, 1). That means that your held out data must have a shape=(..., 270). Your model’s shape constraint is rather light. If we call the last dimension of the data array the features, your model will only be able to work if you always supply it with data that has the same number of features. However, you are completely free to specify the number of rows, or observations, the model gives to y_lik.

If you had wrapped price with a Deterministic, then you would have run into errors because its entire shape and values would have been determined by the training data.


#5

FYI, the issue regarding deterministics of shared values was resolved by this PR. Now, all the deterministics stored in the trace are ignored in sample posterior predictive, so you don’t have to remove them by hand from the trace.


#6

(Sorry for not replying a long time, I was getting my model to just run)

So, the crux of the deal seems to me that, for prediction, models have to be reinitialized with the previous settings, but without pm.Sample() being called. So would something like this work to get the predictions?

with model_factory(x_data, None) as predictive_model:
    ppc = pm.sample_posterior_predictive(train_trace)

Assuming ppc is what holds the predictions, how would I proceed if I do not have any hold_out_y data to input, only hold_out_x?


#7

You can not set the second argument of the factory function to None. That argument is given as observed to y_lik and should ideally be an array. Just supply an array with the adequate shape (which should only depend on your hold_out_x) and arbitrary values, numpy.empty could help. The values will be ignored by sample_posterior_predictive but the shape will be used.


#8

Ok, going ahead and creating fake holdout output data with

fake_holdout_y = np.empty(test_norm_retained.shape)

(test_norm_retained being my actual hold out input data), so that now they both have shapes (1459, 135), I run the code in my last post, and get the error

An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 8))
...
...
...
ttempted to generate values with incompatible shapes:
            size: 1
            dist_shape: (1459,)
            broadcast_shape: (1459, 1)

To verify this error, I have tried every combination of (1459, 135) in np.empty((shape)), but they all produce errors. What do you think is happening now?

Note: my training input and output data had 1460 rows (same columns, though), but would that affect getting predictions with less data?


#9

Is there something wrong I am doing? Or should I just consign myself forevermore to running every model with theano shared variables as inputs?


#10

I get teh same error when I try to change the n parameter in a Binomial observed RV using the “rebuild the model” method.

If I change the n parameter by .set_value() using the dict method I described above, it works…
@junpenglao, is it not possible to change n of Binomial observed RV using the rebuild model method?


#11

I got past this error by making sure my variable was of the correct type.
I looked below the text you quoted from the traceback, and at the bottom was something about cannot cast safely from int64 to int32 or something


#12

Hi @Gon_F, sorry for not replying earlier. I was a bit busy to look much at the discourse. From what I see, your problem is that price will end up having (N, 1) shape, instead of just (N,). Could you try setting beta's shape to (270, 1). If not, you can also try to tt.squeeze(price) when you pass it as y_lik's mu.

I think that the true origin of the problem is that when you set observed=np.log(SalePrice), the sale’s price shape will be used to determine the y_lik variable’s shape. Currently in pymc3, distribution parameter values are not taken into account for the determination of shape, and will likely remain that way. pymc4 will not be like this because it will work with tensorflow distribution’s which are handled as tensors themselves, with some added flavors.

Nevertheless, to tell you a more precise answer, I would need to know the full traceback of the exception, which looks like it is raised from distributions.generate_samples.


#13

It seems that, on the python stack given by azure which I was using, PyMC3 was at ver. 3.5 instead of 3.6, which may have been causing my weird errors. After updating it, the following standard method gives predictions perfectly.

with model_factory(holdout_x, fake_holdout_y) as predictive_model:
    prediction = pm.sample_posterior_predictive(train_trace, samples=5000)

and I have been able to make fake_holdout_y equal to np.empty_like(holdout_x) as well as np.empty((1,2)) (there are minor differences between each set of predictions, but that is supposed to be a side effect of MCMC being an approximation, right?)

So thank you in helping me make this work. Last question, however; after I gain the predictions from pm.sample_posterior_predictive() with the chosen samples size, which subset do I choose as my final predictions? Is the correct thing to simply average over all the samples?


#14

Well, the thing is that you have the probability distribution of the outcomes which is much more information than just a single predicted number. The answer to your question depends very much in what you are trying to do with your predictions and the formal theory behind it is called Bayesian decision making. I’ll give you three very simple examples, in all of the let’s consider that what you were predicting were probable readings of a sensor:

  1. Your distribution has a single mode and you are interested in the expected value of the distribution, so you just have to take the mean over the posterior predictive.
  2. Your distribution is bimodal and you want to predict the most likely reading. In this case you would have to do something with the kde to extract the most likely prediction.
  3. If your prediction is larger than what the sensor actually ends up saying, something horrible happens. If it is smaller there is some small nuisance, if it is right on the mark, nothing wrong happens. This in my opinion is the most important example because it introduces the notion of a return or a cost associated to your prediction. You can then take into account this cost function to make a prediction that is the most accurate with the smallest risk. I recommend that you read this blog post written by @twiecki and @canyon289 where they talk about a example based on supply chain optimization.

#15

Yes, I am already familiarized with cost functions, and how to introduce new ones, mostly from Bayesian Methods for Hackers’ 5th chapter, but in this case, I cannot think of an applicable new cost function, because I am only trying to use my data to predict housing prices, in which neither overestimates nor underestimates are worse (so squared loss seems fine, if I had to choose?).

In addition, all of the chains seem to have converged well, and the posterior checks show regular normal distributions for each variable.

So, in this situation, what do I do to gain the most accurate possible point estimates? Thanks for Wiecki’s article link, they are always good to read.


#16

If that’s the case, the mean of the posterior predictive should be fine.


#17

Last question (I promise): how many samples should be drawn from pm.sample_posterior_predictive()? I arbitrarily chose 100000, but it took around 30 minutes to sample all of that, so I was wondering if there are any rules of thumb for how many samples need to be drawn, given the complexity of a model and a certain desired accuracy (a p-value of sorts for getting one’s true predictions?


#18

The only rule I can think of is to never try to draw more samples than the number of points stored in the trace.