Hi all!
TL;DR
When I’m working with a model that I know is too simple (either because I’m at the first stages of an iterative process or because of external constraints) I would like to incorporate my belief in the model inaccuracy somehow in the estimations. Something like widening credible intervals to account for that distrust.
How would you go about it? I think this is not part of standard practice, perhaps because it can easily slip into arbitrary decision making, but there may be more to it. What’s your take?
Toy example
Let’s say I’m modelling delays, and I’m working with the simplest model possible, a fixed scale exponential. I’m quite certain that the scale is actually time varying, but I don’t have the time or computational resources (imagine it’s not just a exponential but a much more complex model to begin with) to handle that.
For simplicity, in the example below delays come from just two different scales, but I’m using a single scale to model them. The idea: the model will be wrong, but not more wrong than having no model at all, so it makes sense to deploy it first and take advantage of its value while better approximations are developed.
Model code:
import numpy as np
import pymc as pm
import matplotlib.pyplot as plt
plt.style.use('ggplot')
np.random.seed(42)
scale1 = 1
scale2 = 5
samples1 = np.random.exponential(scale1, 500)
samples2 = np.random.exponential(scale2, 500)
all_samples = np.concatenate([samples1, samples2])
with pm.Model() as model:
scale = pm.HalfNormal("scale", 10)
obs = pm.Exponential('obs', scale=scale, observed=all_samples)
trace = pm.sample()
Scale posterior:
pm.plot_posterior(trace, var_names=["scale"])
The true scales are 1 and 5, while the model is quite sure that the scale is between 2.5-3. To me this is perfectly fine, the model claim is not absolute but relative, that is, it’s conditional on the parametrisation being right. “If this is the data generating process, then the scale parameter must be 2.5-3.2”. The core of my question has to do with reframing the condition that bases this statement (spoiler: I want probabilities conditional on my overall beliefs, not just on the model assumptions). This is more easily seen looking at the posterior predictive.
with model:
posterior = pm.sample_posterior_predictive(trace)
ax = pm.plot_ppc(posterior, kind="cumulative", num_pp_samples=100)
cdf_at_10 = (posterior.observed_data.sortby("obs") < 10).mean()
ax.axvline(10, linestyle="--")
ax = ax.axhline(cdf_at_10.to_array().item(), linestyle="--")
Let’s say I’m concerned with the probability of delays less than 10 (let’s say minutes). Once again, the model is quite confident the probability is between 0.94-0.96; when it’s actually 0.93. I see no problem with this, the probabilities are conditional on the model assumptions being right, which they are not.
However, I want to use the model in production nonetheless because it’s kind of useful. I’m not worried about predictions being biased, but I don’t want predictions to express overconfidence because they fail to take into account the inaccuracy introduced by simplification. I’m not interested in predictions based on just the validity of the model’s assumptions, but also on my belief with respect to such validity.
In summary, I don’t trust the model I made and I want to reflect it somehow in the predictions I get. I think this should imply wider credible intervals in predictions.
I think, historically, prudence, as one of the cardinal virtues, was the art of incorporating disbelief with respect to one’s own assumptions in one’s decisions. I’m looking for a way to make this more explicit within a Bayesian framework.
Regards,
Juan.