I understood that curve fitting using PYMC3 can provide model uncertainty, however, how do I verify that the uncertainty given by pymc3 is correct or not? Are there any examples or studies, which demonstrate that uncertainty estimated by pymc3 is accurate?
I have compared pymc3 linear regression vs. a non-linear least squared fit from scipy and the results are shown below
Parameters (slope and intercept) are almost identical. scipy fit can also provide one standard deviation errors on the parameters, which are also similar to uncertainties estimated from pymc3. See the image below:
Note that, pymc3 took almost 4 minutes while scipy only took 2 ms. I thought the only reason I would use pymc3 if it could provide more accurate uncertainty but I havenāt found any rigorous study showing how accurate pymc3ās uncertainty measure is. Any suggestions would be really appreciated.
Monte Carlo methods would simulate the probability distribution of parameters. We can calculate the variance of this distribution afterwards. If we have an equation that can directly and easily calculate the uncertainty of each parameters, there is no point to use Monte Carlo methods.
For example, the closed form equation is better than pymc for the task of fitting a line to points.
There are many things to unpack here, so here are a list of somewhat random comments
Quantifying uncertainty:
From scipy etc you get parametric model that with assumption of the distribution of the error, and sample distribution of the parameters - thatās how you can get the uncertainty of the parameters (usually following a Gaussian assumption). Using pymc3 you dont need to make those assumption and thus much more flexible. You can find models that are difficult to return uncertainty of your point estimate, but never a problem if you do sampling
the point of MCMC
many times, when you compute the mean of the posterior to get an estimate you get something very close to parametric/frequencies models. But expectation of posterior parameter is only one small aspect of how you can use MCMC samples. With MCMC samples you can plug in to any function to compute itās expectation over the posterior space, meaning that if you have additional loss function, additional model that defined on the same space, you can just pass the MCMC samples around and just compute the mean afterward.
As for your main question re how to verify uncertainty from pymc3 is correct, the approach is usually to 1, compare with parametric models, 2, compare with carefully hand-tuned HMC running for a very long time. We mostly rely on the experiment from the Stan folks to make sure the implementation is correct.
Thank you very much for your answers, @junpenglao! I will also try to use Monte Carlo simulation and Bootstrap to estimate the uncertainty and compare with that estimated from pymc3 and scipy.
I wanted to ask follow-up questions regarding the choice of prior.
My parameters are typically positive and within a range [A, B] (govern by physics). Do you think that āB = Uniform(āBā, lower=A, upper=B)ā is an appropriate prior?
Since it is so cheap to perform curve_fit in scipy to estimate B-hat. Do you think it is appropriate to perform curve_fit to estimate B-hat and then choose the prior as āB = Normal(āBā, mu=B-hat, sd = some #)ā for Bayesian fitting?
If youāre working with fully synthetic data for which you can simulate and run many trials then Cook 2006 introduces a method to evaluate the calibration of a posterior, i.e. whether the posterior properly quantifies uncertainty.
See the Evaluation paragraph in Section 4 here for a succinct summary.
Thanks a lot for your suggestions, @gbernstein! My plan is to validate pymc3 on simulated data before using it on the real data.
Thanks for the concise summary below. It is very helpful for me to understand the evaluation procedure.
I have a few follow up questions:
How can I draw the (theta_i, x_i)? I understand the language but how do I put it in python to generate (theta_i, x_i) given a pre-defined prior p(theta) and likelihood p(x|theta)?
What is U_i? Again given F_i(theta) how can I calculate Ui?
The paper I linked was for unconditional models, so Iāll switch here to the notation for linear regression with y conditional on x.
This calibration assessment only works on data generated so that you know the true parameters. In the case of linear regression you can use whatever x you want, then for M trials independently generate your own theta_i ~ p(theta) and y_i ~ p(y | x, theta).
Then given the posterior for a trial, p(theta | x_i, y_i), you need to calculate the value of the true parameter theta_i in that posteriorās CDF, i.e. U_i = F_i(theta_i). If you have the closed form CDF you can use numpyās quantile method. Or if you have enough samples from the posterior you can calculate the empirical quantile as the rank order of theta_i within the sorted samples. One way is U_i = np.sum(samples < true_params_values) / num_samples.
Once you have the U_{1:M} you can get a summary statistic using the Komogorov-Smirnov goodness-of-fit test, which essentially tests the coverage of every credible interval. You can qualitatively examine empirical CDF or quantile-quantile (QQ) plots as in Figure 1b in the linked paper. The larger M the more smooth those curves will be; 20 may start to get you a feel for what the curve will look like, and maybe it will start to clean up around 100 and get extra clean around 500 or 1000 trials? Probably depends on the data and problem.
Thank you so much for your answers, @gbernstein! Your answers are very clear and I think I can start implementing the calibration assessment in python now.
Especially thank you for switching the notation to the regression problem. It allows me to understand better and conveniently translate to my problem, which is actually exponential fitting (Y = A + B*exp(-t/tau)) with limited data i.e. from 3 to 10 data points.
A more recent method is to use https://arxiv.org/abs/1804.06788, which check the uncertainty estimation (in fact the whole posterior distribution) against the latent truth.
@junpenglao thanks for sharing, it discusses a lot of the issues Iāve had using Cookās test.
The mathematical setup makes sense to me but Iām still trying to suss out though how the quantile calculation differs from Cookās test. Is the rank statistic calculation (unnumbered equation right after Algorithm 1) any different from Cookās U_i = np.sum(samples < true_params_values) / num_samples that I noted above? From what I can tell itās just counting how many posterior samples are below the true value, which is just the quantile of the true value within the posterior CDF. Maybe Iām misunderstanding their f, it wasnāt well defined in their previous sentence.
And are the resulting histograms just a different way to view the quantile-quantile plots in Figure 1b here?
From what I gather (i.e., around figure 1 and 2 in the SBC paper), they are arguing that a rank statistic is more robust at the tail than the quantile test as in Cook et al 2006.
Took a look through the code and thereās a bunch of obfuscation because of class objects but as far as I can tell it boils down to the exact equation I wrote above, just without the scaling to (0,1):
I think thatās a decent summary of the net result - donāt scale the result to 0,1; be mindful itās a discrete uniform and visualize with 1 histogram bin per rank.
I guess the differences with Cook are minor but help visualize the calibration analysis.