Evaluating Regression Model Uncertainty with Withheld Data

I am working on a simple method for evaluating the uncertainty estimates of PyMC regression models. So far, my method is to withhold some data from the model training, compute 90% confidence bounds for the coefficient of each X (covariate) based on the posterior distributions, and then predict the 90% confidence estimate for the Y value of each of the withheld rows given the X values. My presumption is that a perfect model would provide error bounds that 90% of the withheld data Y values fit inside.

In Python code, my approach looks something like this (assuming I am fitting a quadratic model with two covariates X1 and X2).

# get posterior estimates for coefficients from PyMC model
coef1 = trace.posterior.X1_coef.data.flatten()
coef2 = trace.posterior.X1_sq_coef.data.flatten()
coef3 = trace.posterior.X2_coef.data.flatten()
coef4 = trace.posterior.X2_sq_coef.data.flatten()

# get 90% confidence interval for each estimate using z-score 1.645
coef1_low = np.mean(coef1) - (np.std(coef1) * 1.645)
coef1_high = np.mean(coef1) + (np.std(coef1) * 1.645)
coef2_low = np.mean(coef2) - (np.std(coef2) * 1.645)
coef2_high = np.mean(coef2) + (np.std(coef2) * 1.645)
coef3_low = np.mean(coef3) - (np.std(coef3) * 1.645)
coef3_high = np.mean(coef3) + (np.std(coef3) * 1.645)
coef4_low = np.mean(coef4) - (np.std(coef4) * 1.645)
coef4_high = np.mean(coef4) + (np.std(coef4) * 1.645)

# calculate percent of withheld data points inside/outside range
in_range = 0
out_range = 0

for row in withheld_rows:

    X1_range = sorted([(row.X1 * coef1_low), (row.X1* coef1_high)])
    X1_2_range = sorted([(np.square(row.X1) * coef2_low), (np.square(row.X1) * coef2_high)])
    X2_range = sorted([(row.X2 * coef3_low), (row.X2 * coef3_high)])
    X2_2_range = sorted([(np.square(row.X2) * coef4_low), (np.square(row.X2) * coef4_high)])

    predict_Y_low = X1_range[0] + X1_2_range[0] + X2_range[0] + X2_2_range[0]
    predict_Y_high = X1_range[1] + X1_2_range[1] + X2_range[1] + X2_2_range[1]

    if row.Y <= predict_Y_high and row.Y >= predict_Y_low:
        in_range += 1
    else:
        out_range += 1

print("% in range:", in_range / len(withheld_rows))
print("& out of range:", out_range / len(withheld_rows))

Now, for some of the datasets/models I am working with, I get pretty darn close to 90% in range (80-90%). For others, I get something somewhat close (60-70% in range) and for others, absolutely terrible (5-10% in range). This huge range of result makes me think something isn’t quite right with the method I’m using, but I’m not sure what.

A couple of high level questions:

  1. Is this method reasonable for evaluating PyMC model uncertainty? Is there another accepted way of doing this that I should know about?
  2. Is the fact that some of my models are predicting a very small % of the withheld data to be “in range” (for example, 10% when I am expecting ~90% for a good model) possibly stemming from model error, or is the most likely culprit the metric I am using?

Thanks everyone!

I’ve been thinking about forecast evaluation for my research project lately, too. I don’t know what other people do, but one idea I’ve had is to use scoring rules for univariate or multivariate probabilistic forecasts. The approaches for ensemble forecasts can apply to posterior predictive samples. What you are describing could be similar to pinball loss or quantile scoring. You could also look at CRPS / energy score and Variogram scoring rules. There are also graphical (qualitative) methods such as PIT to look at model calibration.

Here are some references to look at:

Gneiting, T., & Katzfuss, M. (2014). Probabilistic Forecasting. Annual Review of Statistics and Its Application, 1(1), 125–151. https://doi.org/10.1146/annurev-statistics-062713-085831

Bjerregård, M. B., Møller, J. K., & Madsen, H. (2021). An introduction to multivariate probabilistic forecast evaluation. Energy and AI, 4, 100058. https://doi.org/10.1016/J.EGYAI.2021.100058

Gneiting, T., Stanberry, L. I., Grimit, E. P., Held, L., & Johnson, N. A. (2008). Assessing probabilistic forecasts of multivariate quantities, with an application to ensemble predictions of surface winds. Test, 17(2), 211–235. https://doi.org/10.1007/S11749-008-0114-X/METRICS

Scheuerer, M., & Hamill, T. M. (2015). Variogram-Based Proper Scoring Rules for Probabilistic Forecasts of Multivariate Quantities. Monthly Weather Review, 143(4), 1321–1334. https://doi.org/10.1175/MWR-D-14-00269.1