I am working on a simple method for evaluating the uncertainty estimates of PyMC regression models. So far, my method is to withhold some data from the model training, compute 90% confidence bounds for the coefficient of each X (covariate) based on the posterior distributions, and then predict the 90% confidence estimate for the Y value of each of the withheld rows given the X values. My presumption is that a perfect model would provide error bounds that 90% of the withheld data Y values fit inside.

In Python code, my approach looks something like this (assuming I am fitting a quadratic model with two covariates X1 and X2).

```
# get posterior estimates for coefficients from PyMC model
coef1 = trace.posterior.X1_coef.data.flatten()
coef2 = trace.posterior.X1_sq_coef.data.flatten()
coef3 = trace.posterior.X2_coef.data.flatten()
coef4 = trace.posterior.X2_sq_coef.data.flatten()
# get 90% confidence interval for each estimate using z-score 1.645
coef1_low = np.mean(coef1) - (np.std(coef1) * 1.645)
coef1_high = np.mean(coef1) + (np.std(coef1) * 1.645)
coef2_low = np.mean(coef2) - (np.std(coef2) * 1.645)
coef2_high = np.mean(coef2) + (np.std(coef2) * 1.645)
coef3_low = np.mean(coef3) - (np.std(coef3) * 1.645)
coef3_high = np.mean(coef3) + (np.std(coef3) * 1.645)
coef4_low = np.mean(coef4) - (np.std(coef4) * 1.645)
coef4_high = np.mean(coef4) + (np.std(coef4) * 1.645)
# calculate percent of withheld data points inside/outside range
in_range = 0
out_range = 0
for row in withheld_rows:
X1_range = sorted([(row.X1 * coef1_low), (row.X1* coef1_high)])
X1_2_range = sorted([(np.square(row.X1) * coef2_low), (np.square(row.X1) * coef2_high)])
X2_range = sorted([(row.X2 * coef3_low), (row.X2 * coef3_high)])
X2_2_range = sorted([(np.square(row.X2) * coef4_low), (np.square(row.X2) * coef4_high)])
predict_Y_low = X1_range[0] + X1_2_range[0] + X2_range[0] + X2_2_range[0]
predict_Y_high = X1_range[1] + X1_2_range[1] + X2_range[1] + X2_2_range[1]
if row.Y <= predict_Y_high and row.Y >= predict_Y_low:
in_range += 1
else:
out_range += 1
print("% in range:", in_range / len(withheld_rows))
print("& out of range:", out_range / len(withheld_rows))
```

Now, for some of the datasets/models I am working with, I get pretty darn close to 90% in range (80-90%). For others, I get something somewhat close (60-70% in range) and for others, absolutely terrible (5-10% in range). This huge range of result makes me think something isn’t quite right with the method I’m using, but I’m not sure what.

A couple of high level questions:

- Is this method reasonable for evaluating PyMC model uncertainty? Is there another accepted way of doing this that I should know about?
- Is the fact that some of my models are predicting a very small % of the withheld data to be “in range” (for example, 10% when I am expecting ~90% for a good model) possibly stemming from model error, or is the most likely culprit the metric I am using?

Thanks everyone!