As we evolve and enhance a model, we want to make sure that we don’t accidentally break stuff that used to work. This is a common problem with any software development. The standard software engineering approach is to create a suite of regression tests. Then with every change, the regression tests are run, alerting us when something is newly broken.
Do you build up regression tests for your model? If not, how do you know when you accidentally break something?
What about keeping track of model estimates like loo? If I wanted to go minimalistic maybe this would work.
What ideas did you have?
loo would be useful, to see if predictive fit declines. Also useful: checking that sampling does not diverge. And of course checking whether the posterior distributions are (approximately) the same as before.
One practical problem: sampling is compute-intensive, so it is not practical to sample many different times, with different observed values. Who wants to wait for hours before getting feedback on whether something is newly broken? Tossing the work onto AWS (or another cloud provider) could solve that problem, at some cost.
Currently, I run a test or two by hand and eyeball the results. That’s better than nothing, but sometimes I miss an issue. When I later catch it, I have to check through old versions, to see when the issue first exhibited.
Does anyone do better?
Are you thinking about the data changing over time and concepts drifting away from the original training data? Or your changing of the model structure and wanting to gauge the changes?
If the former, I quite like doing coverage aka calibration checks on the PPC vs observations. Changes in coverage for holdout sets and/or new datasets over time can indicate drift.
If the latter, I’m not sure how that would look different to a normal model development workflow, since altering the model structure and joint posterior might reasonably have any number of subtle associated effects. I’d be interested to hear more ideas!
The latter: changing the model structure and gauging the impact of the changes.
Is this different from a normal model development workflow? It would be, at least for my model development workflow. Here’s what I do today:
- make a change
- examine some parameters, to make sure nothing obvious is now broken
- examine the parameters I expect to be affected by the change
But #2 is far from complete. I often miss stuff that is broken.
Maybe my model development workflow is atypical?
Aha well, that workflow doesn’t sound unreasonable. Have you read this recent treatise from Gelman and co? http://www.stat.columbia.edu/~gelman/research/unpublished/Bayesian_Workflow_article.pdf Contains reams of guidance!
To state the obvious, I’ve found prior predictive checks can really help quickly ‘debug’ the model architecture as you add more cornices / finials / gargoyles Tiny changes in parameters and/or introducing new features with accidentally large ranges (e.g that you didn’t standardize beforehand) can play havoc with marginals.
Wow! That’s a whole (short) book on Bayesian workflow. Thanks for the suggestion. I will read.
Cornices, finials, and gargoyles!
My models are mostly gargoyles …