Do we need a testing set?

I hardly have a definite formed opinion on this either, but my two cents:

The question of having training and test data is not about frequentist vs bayesian methods, but about what kind of datasets you’re looking at, and what your care about. Let’s look at a classical statistical problem.

Because I noticed that I think more carefully about problems like this when the problem sounds like something important is at stake, assume that we have a dataset containing the survival of 100 cancer patients, with one treatment and one control group. The treatment group got a fancy new drug, that we have reason to believe could actually work, and the control group gets an older drug, that we are pretty sure works as well, but maybe not as well as the new one. We want to figure out how the effect of the two drugs differs. In a setting like this, I think most people would agree that splitting the data into a test and a training set, and then only using the training set to draw our conclusions, while we use the testing set for model checking would be a really bad idea. We’d lose a lot of our already sparse data (and we can’t easily get more without potentially killing a lot of people), and on the flip side, do we really gain anything from the test data? Comparing the predictions of two models with the help of a test dataset is just another estimation problem, why would that be any easier than the original estimation problem? We’d be throwing away good data for no good reason.

So instead, we should use all of our dataset for the original model fitting, and use frequentist and/or bayesian methods to say something about sampling error or uncertainty, assuming the model is right. We can and should still do model checking however. We can look at the loo for instance. But so far I found one of the most powerful tools for this is comparing the predictive posterior with the actual data.

Just compare that with say the face recognition that facebook is doing. They don’t really care about parameters, they want to write a good black-box that you feed images to and that spits out a name. They don’t have much of a limit on how much data they can use and so can afford to evaluate different methods on test data. With the training/test split you can take any method that can be trained and can predict, and try to solve the estimation problem “which of those predicts better”. It allows you to use a very wide range of different tools, and you don’t need to worry so much about the foundations of those tools. As such you can of course also test bayesian methods, and doing so might be a very good idea under some circumstances.

Update: One more fundamental difference between many machine learning algos and bayesian methods, is that the bayesian methods know on their own how precise its predictions would be if the model were true.

3 Likes