@EtienneT, this is part of my learning journey here, so thank you for posting your question.
I remember distinctly getting a pindrop silence right after I blurted out those words, as if I had spoken some heresy. Perhaps it was the “live” nature of the situation, I didn’t communicate it as clearly as I should have.
There’s model specification uncertainty, and then there’s parameter uncertainty. When we’re evaluating a model, we’re asking two questions, sequentially:
- Are we certain about the model?
- Given that we are certain about the model, are we certain about the parameter values?
The first question is both an art and a science. A good model spec will have high concordance with what we know about how the world works. I find that difficult to quantify. For example, how would one quantify the correctness of this model w.r.t. its concordance with reality? We’re really asking a philosophical question about whether we believe in causality or not! To answer the first question, where possible and defensible, I would much prefer to inject as much knowledge about X causes Y into my model.
With deep nets, we sort of throw out the first question - since deep nets can approximate any function, we don’t worry about the model specification. That said, deep nets have the capacity to overfit to the data, which is where train/test splits come in handy.
What then about Bayesian neural networks (BNN)? From intuition and empirical observation, I can offer this (perhaps not-so-satisfying, and maybe somewhat-incorrect-as-well) explanation: When fitting a BNN, some parameters will be so obviously a certain value that they’ll quickly converge there during the fitting phase. Other parameters, we’ll not be so certain (or perhaps we may never need to be certain about them). If we hold out data, then we just give less data to the model, and so we’ll be just less certain about parameter values.
Counter-factually, if we didn’t go Bayesian, there’s a potential for those “uncertain” parameter values to take on other point estimates that let us overfit.
Thus, where I was going with that train/test split comment was that conditioned on the model being correct (or being black box), we don’t need train/test splits in a Bayesian setting. That said, God help us all if we have our model specified incorrectly!
In conclusion, I concur with @junpenglao’s final conclusion - “model performance should be evaluated by whether it can predict future data”. Use as many methods as you can to justify the model structure - train/test split, argumentation with/learning from subject matter experts, information criterion, leave-one-out CV. That’s the more important step in the modeling process, and IMHO, given the complexity of the world and the approximate nature of our models, there shouldn’t be “one and preferably only one way” to evaluate how good our model is.