Although I do like your idea of how often the held-out values fall within the predictive credible intervals. Could you expand on this slightly as I’m not sure I fully understand, but think it would be really useful.
Sure - this is basically just a check to make sure that the uncertainty (predictive uncertainty for GP values at new points, or inferential uncertainty for estimating parameters) which you get from your posterior matches the true underlying variation. If you repeat an experiment with held-out data 1000 times and, for each experiment, only 500 of them fall within the 90% credible interval, then your interval is too narrow whereas if 995 of them do, then you still have an issue - there are too many of them and the reported uncertainty is too large. For a variety of reasons related to optimal decision making and estimation, it is desirable to have the actual coverage probability, i.e. the number of times your true value falls within the predicted interval, as close as possible to its nominal value.
Now, it may not exactly suit your purpose as a stopping criterion; even when there are very few data points, the GP may still give predictive credible intervals (e.g. the range between the 5% and 95% percentiles of the predictive distribution) which are close to their nominal AKA desired values. If you have a large amount of data, then the predictive credible intervals may shrink to some value which you use as a threshold. Of course, this may depend on the predictive distribution over multiple points. In my experience, the implementation of GPs in PyMC along with the priors recommended in the docs will give results with coverage probabilities very close to their nominal levels for most non-pathological tasks.