Do we need a testing set?

I was listening to @ericmjl talk from PyData An Attempt At Demystifying Bayesian Deep Learning (really loved the talk by the way!) and he mentioned at the end during the comments that in a beyasian setting, we don’t need a testing set. Being very new to the bayesian world, this got me really confused. Then I read his blog post about it here where he further explain it and quote Radford M. Neal thesis.

I just wanted to start a discussion to try to better understand this. If I want to use my model to predict data, how could I evaluate how good my model is if I don’t test its performance on unseen data?


Sincerely, a confused bayesian newbie

1 Like

My impression of what Eric means is that Bayesian model does not overfit, so you don’t need a testing set to evaluate overfitting problem. However, I think this is also under some context: you can surely have overfitting problem if your model is wrongly specified. It might be a bit more meaningful to think that under similar model complexity, a Bayesian model is more robust than a Frequensitic model.
(my two cents…)


Ok so traditionally in ML you would be using either a validation set (or use cross-validation) and a final testing set. The validation set would be used to optimize the hyper-parameters of the algorithm. In the bayesian setting we basically just have the priors hyper-parameters that we might want to tune. But what you are essentially saying is that we don’t really need a validation set here because the model won’t really overfit the data?

Are the stats we can calculate from our traces like pymc3.stats.dic, pymc3.stats.bpic, pymc3.stats.waic, pymc3.stats.loo, can be used to evaluate the performance a model on unseen data? I know that in the frequentist setting we can’t really rely on any performance statistics calculated on the training set. Is this similar also for bayesian?


I am a big believer in cross-validation, so things like k-fold and leave-one-out cross-validation is my go-to approach. In that regards, pymc3.stats.loo might be the best metric, since it approximates the leave-one-out cross-validation score of your model. pymc3.stats.waic is also a good metric.

But ultimately, the model performance should be evaluated by whether it can predict future data.


@EtienneT, this is part of my learning journey here, so thank you for posting your question.

I remember distinctly getting a pindrop silence right after I blurted out those words, as if I had spoken some heresy. Perhaps it was the “live” nature of the situation, I didn’t communicate it as clearly as I should have.

There’s model specification uncertainty, and then there’s parameter uncertainty. When we’re evaluating a model, we’re asking two questions, sequentially:

  1. Are we certain about the model?
  2. Given that we are certain about the model, are we certain about the parameter values?

The first question is both an art and a science. A good model spec will have high concordance with what we know about how the world works. I find that difficult to quantify. For example, how would one quantify the correctness of this model w.r.t. its concordance with reality? We’re really asking a philosophical question about whether we believe in causality or not! To answer the first question, where possible and defensible, I would much prefer to inject as much knowledge about X causes Y into my model.

With deep nets, we sort of throw out the first question - since deep nets can approximate any function, we don’t worry about the model specification. That said, deep nets have the capacity to overfit to the data, which is where train/test splits come in handy.

What then about Bayesian neural networks (BNN)? From intuition and empirical observation, I can offer this (perhaps not-so-satisfying, and maybe somewhat-incorrect-as-well) explanation: When fitting a BNN, some parameters will be so obviously a certain value that they’ll quickly converge there during the fitting phase. Other parameters, we’ll not be so certain (or perhaps we may never need to be certain about them). If we hold out data, then we just give less data to the model, and so we’ll be just less certain about parameter values.

Counter-factually, if we didn’t go Bayesian, there’s a potential for those “uncertain” parameter values to take on other point estimates that let us overfit.

Thus, where I was going with that train/test split comment was that conditioned on the model being correct (or being black box), we don’t need train/test splits in a Bayesian setting. That said, God help us all if we have our model specified incorrectly!

In conclusion, I concur with @junpenglao’s final conclusion - “model performance should be evaluated by whether it can predict future data”. Use as many methods as you can to justify the model structure - train/test split, argumentation with/learning from subject matter experts, information criterion, leave-one-out CV. That’s the more important step in the modeling process, and IMHO, given the complexity of the world and the approximate nature of our models, there shouldn’t be “one and preferably only one way” to evaluate how good our model is.


Thank you both for your replies I think this clarifies a lot! Haha yeah the moment you said the statement in the presentation you could feel the room go quiet!

So let me try to summarize what I understand. Given that we are certain about our model is the right one to represent our data (your point 1.), since the frequentist model doesn’t carry uncertainty throughout the model and through its predictions this allows the model to overfit the data by learning something that minimize some cost function as much as possible without taking into account the uncertainty.

If we compare that to a bayesian model where uncertainty is kept through all the process (our parameters are actually probability distributions that best represent our data), the model will take that uncertainty into account for the predictions also and won’t overfit the data. The probability distributions of the parameters account for the uncertainty in the data and thus can’t really overfit the data.

But, this doesn’t remove the need to test the performance of the model on unseen data if we want to use it for prediction.

Thanks a lot for your time and I will definitely watch your next presentations if they are on youtube!


I think you’ve summarized what I was thinking much better than I could have, @EtienneT!

I hardly have a definite formed opinion on this either, but my two cents:

The question of having training and test data is not about frequentist vs bayesian methods, but about what kind of datasets you’re looking at, and what your care about. Let’s look at a classical statistical problem.

Because I noticed that I think more carefully about problems like this when the problem sounds like something important is at stake, assume that we have a dataset containing the survival of 100 cancer patients, with one treatment and one control group. The treatment group got a fancy new drug, that we have reason to believe could actually work, and the control group gets an older drug, that we are pretty sure works as well, but maybe not as well as the new one. We want to figure out how the effect of the two drugs differs. In a setting like this, I think most people would agree that splitting the data into a test and a training set, and then only using the training set to draw our conclusions, while we use the testing set for model checking would be a really bad idea. We’d lose a lot of our already sparse data (and we can’t easily get more without potentially killing a lot of people), and on the flip side, do we really gain anything from the test data? Comparing the predictions of two models with the help of a test dataset is just another estimation problem, why would that be any easier than the original estimation problem? We’d be throwing away good data for no good reason.

So instead, we should use all of our dataset for the original model fitting, and use frequentist and/or bayesian methods to say something about sampling error or uncertainty, assuming the model is right. We can and should still do model checking however. We can look at the loo for instance. But so far I found one of the most powerful tools for this is comparing the predictive posterior with the actual data.

Just compare that with say the face recognition that facebook is doing. They don’t really care about parameters, they want to write a good black-box that you feed images to and that spits out a name. They don’t have much of a limit on how much data they can use and so can afford to evaluate different methods on test data. With the training/test split you can take any method that can be trained and can predict, and try to solve the estimation problem “which of those predicts better”. It allows you to use a very wide range of different tools, and you don’t need to worry so much about the foundations of those tools. As such you can of course also test bayesian methods, and doing so might be a very good idea under some circumstances.

Update: One more fundamental difference between many machine learning algos and bayesian methods, is that the bayesian methods know on their own how precise its predictions would be if the model were true.


One difference to consider is the setting of your problem - if it is generative vs. discriminative. If you are learning a distribution of the data, then there is technically no need for a validation data set; because you will be partitioning the data you are given to create train-validation data set. You could use all of the samples to learn the distribution. And then simply generate test predictions. So the setting is to learn the distribution of your X:


However, if it is a discriminative (classification) or regression setting, it would behoove you to have validation set.


More references and resources in @twiecki’s post.