How to predict new values on hold-out data

(Sorry for not replying a long time, I was getting my model to just run)

So, the crux of the deal seems to me that, for prediction, models have to be reinitialized with the previous settings, but without pm.Sample() being called. So would something like this work to get the predictions?

with model_factory(x_data, None) as predictive_model:
    ppc = pm.sample_posterior_predictive(train_trace)

Assuming ppc is what holds the predictions, how would I proceed if I do not have any hold_out_y data to input, only hold_out_x?

You can not set the second argument of the factory function to None. That argument is given as observed to y_lik and should ideally be an array. Just supply an array with the adequate shape (which should only depend on your hold_out_x) and arbitrary values, numpy.empty could help. The values will be ignored by sample_posterior_predictive but the shape will be used.

Ok, going ahead and creating fake holdout output data with

fake_holdout_y = np.empty(test_norm_retained.shape)

(test_norm_retained being my actual hold out input data), so that now they both have shapes (1459, 135), I run the code in my last post, and get the error

An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line string', (1, 8))
...
...
...
ttempted to generate values with incompatible shapes:
            size: 1
            dist_shape: (1459,)
            broadcast_shape: (1459, 1)

To verify this error, I have tried every combination of (1459, 135) in np.empty((shape)), but they all produce errors. What do you think is happening now?

Note: my training input and output data had 1460 rows (same columns, though), but would that affect getting predictions with less data?

Is there something wrong I am doing? Or should I just consign myself forevermore to running every model with theano shared variables as inputs?

1 Like

I get teh same error when I try to change the n parameter in a Binomial observed RV using the “rebuild the model” method.

If I change the n parameter by .set_value() using the dict method I described above, it works…
@junpenglao, is it not possible to change n of Binomial observed RV using the rebuild model method?

I got past this error by making sure my variable was of the correct type.
I looked below the text you quoted from the traceback, and at the bottom was something about cannot cast safely from int64 to int32 or something

Hi @Gon_F, sorry for not replying earlier. I was a bit busy to look much at the discourse. From what I see, your problem is that price will end up having (N, 1) shape, instead of just (N,). Could you try setting beta's shape to (270, 1). If not, you can also try to tt.squeeze(price) when you pass it as y_lik's mu.

I think that the true origin of the problem is that when you set observed=np.log(SalePrice), the sale’s price shape will be used to determine the y_lik variable’s shape. Currently in pymc3, distribution parameter values are not taken into account for the determination of shape, and will likely remain that way. pymc4 will not be like this because it will work with tensorflow distribution’s which are handled as tensors themselves, with some added flavors.

Nevertheless, to tell you a more precise answer, I would need to know the full traceback of the exception, which looks like it is raised from distributions.generate_samples.

It seems that, on the python stack given by azure which I was using, PyMC3 was at ver. 3.5 instead of 3.6, which may have been causing my weird errors. After updating it, the following standard method gives predictions perfectly.

with model_factory(holdout_x, fake_holdout_y) as predictive_model:
    prediction = pm.sample_posterior_predictive(train_trace, samples=5000)

and I have been able to make fake_holdout_y equal to np.empty_like(holdout_x) as well as np.empty((1,2)) (there are minor differences between each set of predictions, but that is supposed to be a side effect of MCMC being an approximation, right?)

So thank you in helping me make this work. Last question, however; after I gain the predictions from pm.sample_posterior_predictive() with the chosen samples size, which subset do I choose as my final predictions? Is the correct thing to simply average over all the samples?

1 Like

Well, the thing is that you have the probability distribution of the outcomes which is much more information than just a single predicted number. The answer to your question depends very much in what you are trying to do with your predictions and the formal theory behind it is called Bayesian decision making. I’ll give you three very simple examples, in all of the let’s consider that what you were predicting were probable readings of a sensor:

  1. Your distribution has a single mode and you are interested in the expected value of the distribution, so you just have to take the mean over the posterior predictive.
  2. Your distribution is bimodal and you want to predict the most likely reading. In this case you would have to do something with the kde to extract the most likely prediction.
  3. If your prediction is larger than what the sensor actually ends up saying, something horrible happens. If it is smaller there is some small nuisance, if it is right on the mark, nothing wrong happens. This in my opinion is the most important example because it introduces the notion of a return or a cost associated to your prediction. You can then take into account this cost function to make a prediction that is the most accurate with the smallest risk. I recommend that you read this blog post written by @twiecki and @RavinKumar where they talk about a example based on supply chain optimization.
1 Like

Yes, I am already familiarized with cost functions, and how to introduce new ones, mostly from Bayesian Methods for Hackers’ 5th chapter, but in this case, I cannot think of an applicable new cost function, because I am only trying to use my data to predict housing prices, in which neither overestimates nor underestimates are worse (so squared loss seems fine, if I had to choose?).

In addition, all of the chains seem to have converged well, and the posterior checks show regular normal distributions for each variable.

So, in this situation, what do I do to gain the most accurate possible point estimates? Thanks for Wiecki’s article link, they are always good to read.

If that’s the case, the mean of the posterior predictive should be fine.

Last question (I promise): how many samples should be drawn from pm.sample_posterior_predictive()? I arbitrarily chose 100000, but it took around 30 minutes to sample all of that, so I was wondering if there are any rules of thumb for how many samples need to be drawn, given the complexity of a model and a certain desired accuracy (a p-value of sorts for getting one’s true predictions?

The only rule I can think of is to never try to draw more samples than the number of points stored in the trace.

Is this really correct, @lucianopaz? We have been working on a problem that is not regression, but has a similar flavor. Following @junpenglao’s Code 10 example, we have deterministic variables that select parameters from parameter matrices. We definitely get different results if we censor the deterministics from the input MultiTrace before we invoke pm.sample_posterior_predictive. It sounds like you are saying that draw_values should ignore these trace elements?

If that is so, maybe we should try to extract a problematic case.

@rpgoldman, could you post a minimal model that shows this? I will take a closer look at it tomorrow.

I apologize – this was a false positive. My colleague just told me that the problem was an error in post processing, not in the sampling.
I will check to verify that and let you know.

1 Like

This’s a nice workaround. How can I modify this resolution in a situation when working with pm.GLM.from_formula.

Here is my model building block that I prefer not to change:

with pm.Model() as model:
	priors = {
			"Intercept"      : pm.Normal.dist(mu=start_mu['Intercept']     , sigma=start_sig['Intercept']),
			"a" : pm.Normal.dist(mu=start_mu['a'], sigma=start_sig['a']),
			"b" : pm.Normal.dist(mu=start_mu['b'], sigma=start_sig['b']),
			"c" : pm.Normal.dist(mu=start_mu['c'], sigma=start_sig['c']),
			"d"  : pm.Normal.dist(mu=start_mu['d']    , sigma=start_sig['d'])
		}
	family = pm.glm.families.StudentT()
    pm.GLM.from_formula(formula = 'y ~ a + b + c + d',
                        data    = train_df,
                        priors  = priors,
                        family  = family
                       )
	trace = pm.sample(draws=11000,
                      tune=700,
                      init='advi',
                      start=None,
                      cores=ncores,
                      chains=ncores,
                      random_seed=[123 for _ in range(ncores)],
                      discard_tuned_samples=True,
                      compute_convergence_checks=True)

Thanks!

You just have to try to move your model’s definition into a function that receives either the training or test data:

def model_factory(data):
    with pm.Model() as model:
        priors = {
			"Intercept"      : pm.Normal.dist(mu=start_mu['Intercept']     , sigma=start_sig['Intercept']),
			"a" : pm.Normal.dist(mu=start_mu['a'], sigma=start_sig['a']),
			"b" : pm.Normal.dist(mu=start_mu['b'], sigma=start_sig['b']),
			"c" : pm.Normal.dist(mu=start_mu['c'], sigma=start_sig['c']),
			"d"  : pm.Normal.dist(mu=start_mu['d']    , sigma=start_sig['d'])
		}
        family = pm.glm.families.StudentT()
        pm.GLM.from_formula(formula = 'y ~ a + b + c + d',
                        data    = data,
                        priors  = priors,
                        family  = family
                       )
    return model

with model_factory(train_data):
    trace = pm.sample(draws=11000,
                      tune=700,
                      init='advi',
                      start=None,
                      cores=ncores,
                      chains=ncores,
                      random_seed=[123 for _ in range(ncores)],
                      discard_tuned_samples=True,
                      compute_convergence_checks=True)

with model_factory(test_data):
    ppc = pm.sample_posterior_predictive(trace) #or whatever
1 Like

Thank you for this! it was super helpful.

  1. Would it be possible to show an example of a hierarchical linear model as well? I’m finding it hard to plot the prediction vs train and test.
  2. Also, if you can point to any article that helps build and predict hierarchical data that’s not necessarily linear.
    Thank you for your time!

Hi @Sam_Anand!
I think the PyMC3 port of Rethinking_2 chapters 13 and 14 (hierarchical models) could help you with that :wink:
PyMCheers :vulcan_salute: