Evaluating Regression Model Uncertainty with Withheld Data

hayfreed · April 12, 2024, 1:52pm

I am working on a simple method for evaluating the uncertainty estimates of PyMC regression models. So far, my method is to withhold some data from the model training, compute 90% confidence bounds for the coefficient of each X (covariate) based on the posterior distributions, and then predict the 90% confidence estimate for the Y value of each of the withheld rows given the X values. My presumption is that a perfect model would provide error bounds that 90% of the withheld data Y values fit inside.

In Python code, my approach looks something like this (assuming I am fitting a quadratic model with two covariates X1 and X2).

# get posterior estimates for coefficients from PyMC model
coef1 = trace.posterior.X1_coef.data.flatten()
coef2 = trace.posterior.X1_sq_coef.data.flatten()
coef3 = trace.posterior.X2_coef.data.flatten()
coef4 = trace.posterior.X2_sq_coef.data.flatten()

# get 90% confidence interval for each estimate using z-score 1.645
coef1_low = np.mean(coef1) - (np.std(coef1) * 1.645)
coef1_high = np.mean(coef1) + (np.std(coef1) * 1.645)
coef2_low = np.mean(coef2) - (np.std(coef2) * 1.645)
coef2_high = np.mean(coef2) + (np.std(coef2) * 1.645)
coef3_low = np.mean(coef3) - (np.std(coef3) * 1.645)
coef3_high = np.mean(coef3) + (np.std(coef3) * 1.645)
coef4_low = np.mean(coef4) - (np.std(coef4) * 1.645)
coef4_high = np.mean(coef4) + (np.std(coef4) * 1.645)

# calculate percent of withheld data points inside/outside range
in_range = 0
out_range = 0

for row in withheld_rows:

    X1_range = sorted([(row.X1 * coef1_low), (row.X1* coef1_high)])
    X1_2_range = sorted([(np.square(row.X1) * coef2_low), (np.square(row.X1) * coef2_high)])
    X2_range = sorted([(row.X2 * coef3_low), (row.X2 * coef3_high)])
    X2_2_range = sorted([(np.square(row.X2) * coef4_low), (np.square(row.X2) * coef4_high)])

    predict_Y_low = X1_range[0] + X1_2_range[0] + X2_range[0] + X2_2_range[0]
    predict_Y_high = X1_range[1] + X1_2_range[1] + X2_range[1] + X2_2_range[1]

    if row.Y <= predict_Y_high and row.Y >= predict_Y_low:
        in_range += 1
    else:
        out_range += 1

print("% in range:", in_range / len(withheld_rows))
print("& out of range:", out_range / len(withheld_rows))

Now, for some of the datasets/models I am working with, I get pretty darn close to 90% in range (80-90%). For others, I get something somewhat close (60-70% in range) and for others, absolutely terrible (5-10% in range). This huge range of result makes me think something isn’t quite right with the method I’m using, but I’m not sure what.

A couple of high level questions:

Is this method reasonable for evaluating PyMC model uncertainty? Is there another accepted way of doing this that I should know about?
Is the fact that some of my models are predicting a very small % of the withheld data to be “in range” (for example, 10% when I am expecting ~90% for a good model) possibly stemming from model error, or is the most likely culprit the metric I am using?

Thanks everyone!

pdb5627 · April 20, 2024, 6:44am

I’ve been thinking about forecast evaluation for my research project lately, too. I don’t know what other people do, but one idea I’ve had is to use scoring rules for univariate or multivariate probabilistic forecasts. The approaches for ensemble forecasts can apply to posterior predictive samples. What you are describing could be similar to pinball loss or quantile scoring. You could also look at CRPS / energy score and Variogram scoring rules. There are also graphical (qualitative) methods such as PIT to look at model calibration.

Here are some references to look at:

Gneiting, T., & Katzfuss, M. (2014). Probabilistic Forecasting. Annual Review of Statistics and Its Application, 1(1), 125–151. https://doi.org/10.1146/annurev-statistics-062713-085831

Bjerregård, M. B., Møller, J. K., & Madsen, H. (2021). An introduction to multivariate probabilistic forecast evaluation. Energy and AI, 4, 100058. https://doi.org/10.1016/J.EGYAI.2021.100058

Gneiting, T., Stanberry, L. I., Grimit, E. P., Held, L., & Johnson, N. A. (2008). Assessing probabilistic forecasts of multivariate quantities, with an application to ensemble predictions of surface winds. Test, 17(2), 211–235. https://doi.org/10.1007/S11749-008-0114-X/METRICS

Scheuerer, M., & Hamill, T. M. (2015). Variogram-Based Proper Scoring Rules for Probabilistic Forecasts of Multivariate Quantities. Monthly Weather Review, 143(4), 1321–1334. https://doi.org/10.1175/MWR-D-14-00269.1

Topic		Replies	Views
How to verify that uncertainty (estimated from pymc3) is accurate? Questions	29	4154	July 30, 2021
How to interpret posterior/prior predictive checks v5 modeling , prediction , model-checking	2	541	September 25, 2023
Uncertainty Propagation according to GUM modeling	2	122	May 12, 2024
Observed and Simulated data difference and prediction of unseen data v5	0	307	August 2, 2023
Measurement Uncertainties in Multiple Regression Questions	3	1399	April 28, 2020

Evaluating Regression Model Uncertainty with Withheld Data

Related topics