How to verify that uncertainty (estimated from pymc3) is accurate?

Hi everyone,

I understood that curve fitting using PYMC3 can provide model uncertainty, however, how do I verify that the uncertainty given by pymc3 is correct or not? Are there any examples or studies, which demonstrate that uncertainty estimated by pymc3 is accurate?

I have compared pymc3 linear regression vs. a non-linear least squared fit from scipy and the results are shown below

Parameters (slope and intercept) are almost identical. scipy fit can also provide one standard deviation errors on the parameters, which are also similar to uncertainties estimated from pymc3. See the image below:

Note that, pymc3 took almost 4 minutes while scipy only took 2 ms. I thought the only reason I would use pymc3 if it could provide more accurate uncertainty but I haven’t found any rigorous study showing how accurate pymc3’s uncertainty measure is. Any suggestions would be really appreciated.


Monte Carlo methods would simulate the probability distribution of parameters. We can calculate the variance of this distribution afterwards. If we have an equation that can directly and easily calculate the uncertainty of each parameters, there is no point to use Monte Carlo methods.

For example, the closed form equation is better than pymc for the task of fitting a line to points.

There are many things to unpack here, so here are a list of somewhat random comments

  • Quantifying uncertainty:
    From scipy etc you get parametric model that with assumption of the distribution of the error, and sample distribution of the parameters - that’s how you can get the uncertainty of the parameters (usually following a Gaussian assumption). Using pymc3 you dont need to make those assumption and thus much more flexible. You can find models that are difficult to return uncertainty of your point estimate, but never a problem if you do sampling
  • the point of MCMC
    many times, when you compute the mean of the posterior to get an estimate you get something very close to parametric/frequencies models. But expectation of posterior parameter is only one small aspect of how you can use MCMC samples. With MCMC samples you can plug in to any function to compute it’s expectation over the posterior space, meaning that if you have additional loss function, additional model that defined on the same space, you can just pass the MCMC samples around and just compute the mean afterward.

As for your main question re how to verify uncertainty from pymc3 is correct, the approach is usually to 1, compare with parametric models, 2, compare with carefully hand-tuned HMC running for a very long time. We mostly rely on the experiment from the Stan folks to make sure the implementation is correct.


Thanks for your comment!

Thank you very much for your answers, @junpenglao! I will also try to use Monte Carlo simulation and Bootstrap to estimate the uncertainty and compare with that estimated from pymc3 and scipy.

I wanted to ask follow-up questions regarding the choice of prior.

  1. My parameters are typically positive and within a range [A, B] (govern by physics). Do you think that “B = Uniform(‘B’, lower=A, upper=B)” is an appropriate prior?

  2. Since it is so cheap to perform curve_fit in scipy to estimate B-hat. Do you think it is appropriate to perform curve_fit to estimate B-hat and then choose the prior as “B = Normal(‘B’, mu=B-hat, sd = some #)” for Bayesian fitting?


Well, it’s prior, so if you have information more than the range you should try to code that in!

This is the idea of empirical Bayes - I think it works for some use case but you might want to read a bit more to decide whether you should use it.

1 Like

Thanks a lot, @junpenglao!

If you’re working with fully synthetic data for which you can simulate and run many trials then Cook 2006 introduces a method to evaluate the calibration of a posterior, i.e. whether the posterior properly quantifies uncertainty.

See the Evaluation paragraph in Section 4 here for a succinct summary.

1 Like

Thanks a lot for your suggestions, @gbernstein! My plan is to validate pymc3 on simulated data before using it on the real data.

Thanks for the concise summary below. It is very helpful for me to understand the evaluation procedure.

I have a few follow up questions:

  1. How can I draw the (theta_i, x_i)? I understand the language but how do I put it in python to generate (theta_i, x_i) given a pre-defined prior p(theta) and likelihood p(x|theta)?
  2. What is U_i? Again given F_i(theta) how can I calculate Ui?
  3. What is a good choice for M?

Thanks a lot and really appreciate your feedback!

The paper I linked was for unconditional models, so I’ll switch here to the notation for linear regression with y conditional on x.

  1. This calibration assessment only works on data generated so that you know the true parameters. In the case of linear regression you can use whatever x you want, then for M trials independently generate your own theta_i ~ p(theta) and y_i ~ p(y | x, theta).

  2. Then given the posterior for a trial, p(theta | x_i, y_i), you need to calculate the value of the true parameter theta_i in that posterior’s CDF, i.e. U_i = F_i(theta_i). If you have the closed form CDF you can use numpy’s quantile method. Or if you have enough samples from the posterior you can calculate the empirical quantile as the rank order of theta_i within the sorted samples. One way is U_i = np.sum(samples < true_params_values) / num_samples.

  3. Once you have the U_{1:M} you can get a summary statistic using the Komogorov-Smirnov goodness-of-fit test, which essentially tests the coverage of every credible interval. You can qualitatively examine empirical CDF or quantile-quantile (QQ) plots as in Figure 1b in the linked paper. The larger M the more smooth those curves will be; 20 may start to get you a feel for what the curve will look like, and maybe it will start to clean up around 100 and get extra clean around 500 or 1000 trials? Probably depends on the data and problem.

1 Like

Thank you so much for your answers, @gbernstein! Your answers are very clear and I think I can start implementing the calibration assessment in python now.

Especially thank you for switching the notation to the regression problem. It allows me to understand better and conveniently translate to my problem, which is actually exponential fitting (Y = A + B*exp(-t/tau)) with limited data i.e. from 3 to 10 data points.

A more recent method is to use, which check the uncertainty estimation (in fact the whole posterior distribution) against the latent truth.


Thanks for the reference, @junpenglao. I will read it to see if I could apply to my problem.

@junpenglao thanks for sharing, it discusses a lot of the issues I’ve had using Cook’s test.

The mathematical setup makes sense to me but I’m still trying to suss out though how the quantile calculation differs from Cook’s test. Is the rank statistic calculation (unnumbered equation right after Algorithm 1) any different from Cook’s U_i = np.sum(samples < true_params_values) / num_samples that I noted above? From what I can tell it’s just counting how many posterior samples are below the true value, which is just the quantile of the true value within the posterior CDF. Maybe I’m misunderstanding their f, it wasn’t well defined in their previous sentence.

And are the resulting histograms just a different way to view the quantile-quantile plots in Figure 1b here?

From what I gather (i.e., around figure 1 and 2 in the SBC paper), they are arguing that a rank statistic is more robust at the tail than the quantile test as in Cook et al 2006.

Agreed, but I can’t figure out where there calculation of r differs from the calculation U_i above except for a factor of L.

I have found the author’s implementation of SBC ( and the other implementation on pymc3 ( FYI: I haven’t looked at the code in detail but sometimes it is easier to grasp an idea from a line of code rather than from a lengthy text.


Thanks, good find.

Took a look through the code and there’s a bunch of obfuscation because of class objects but as far as I can tell it boils down to the exact equation I wrote above, just without the scaling to (0,1):

def order_stat(prior_val, posterior_samples):
        return np.sum(prior_val < posterior_samples, axis=0)

From correspondence with the main author:

I think that’s a decent summary of the net result - don’t scale the result to 0,1; be mindful it’s a discrete uniform and visualize with 1 histogram bin per rank.

I guess the differences with Cook are minor but help visualize the calibration analysis.

1 Like

Looks like your equation is in the reverse order as the one you cited above. Does it matter? Thanks.