I’ve run into similar issues in my own research. Here are some of the visualizations I’ve done
- What you’re describing sounds like a learning curve built up through cross-validation. Here’s how I’ve shown that before to compare how quickly different model variations improve with the amount of data they’re given. These are metrics on the test set of my data, with the size of the training set on the x-axis. I showed similar curves for the training set. I did ~100 random subets for each test/train split to get the confidence intervals. Some people overlap confidence intervals on plots like this, but I find those basically impossible to read. Showing the NLPD (negative log probability density) is good because it is a nice indicator for whether the uncertainty in your predictions is properly calibrated. A high RMSE implies your model is inaccurate whereas a high NLPD implies your model is overconfident (because the real observations are occuring in regions of low probability density, meaning the predictive uncertainty is concentrated too far away from the predictions).
- Another way to examine calibration is to plot observed error (x-axis) vs predictive uncertainty (y-axis) at each point as a scatter plot, along with a straight line with slope=2. Roughly 95% of your points should fall below that line, otherwise your model is over/under confident.
- You could show actual vs predicted w/ error bars, hoping that the data cluster around a line with slope 1
- My suggestion for storing the data at each iteration of your cross-validation loop is to instead store a reproducible way to re-generate the data. For example, if you use
np.random.choiceto split your data into test/train, then all you need to store is the random seed and the subset size and you’ll always be able to reconstruct your test/train split.
I’ve actually written a package to make a lot of this easier, it’s called Gumbi (Github, docs). It’s designed for quickly prototyping GPs for tabular (pandas) data. It even includes a tool to help with cross-validation. You still have to run a loop, but this makes each iteration easier to write (be sure to see the note about reproducible randomness).

