Is it getting better or do you feel the same?: How to evaluate a model by estimating log probability of ground truth in posterior trace?

I have a model M that estimates S, an array of 100 elements in [0.0, 1.0]. I also have a synthetically-generated dataset D that is used to test the model. D includes the true (ground truth) value for S. I would like to measure the accuracy of M by comparing S_M to S_D, i.e. by evaluating a function accuracy_model() that measures the distance from S_M to S_D. And when I consider a change to M, I would like to use accuracy_model() to see if the potential change makes M more accurate or less accurate.

S_D is an array of floats, but S_M is an array of distributions; each element of S_M is a distribution of floats, with the posterior sampled as an approximation of that distribution. I think the best way to measure the distance from an individual element of S_D (e.g. S_D[1]) to the corresponding element of S_M (e.g. S_M[1]) is to find the log probability of S_D[1] within the distribution S_M[1]. Then the best measure of the total distance from S_D to S_M is the sum of those log probabilities of each individual element. (If my accuracy measure is flawed, please tell me.)

Does pymc3 have support for measuring the log probability of a quantity (i.e. S_D) against the posterior trace (i.e. S_M)?

1 Like

If I am understanding correctly, you are looking to treat your model as estimator and evaluate it accordingly. This seems fine and there are techniques for doing so, but the approach you describe isn’t particularly Bayesian. And this isn’t just a bit of tribalism. If you try to do this bit:

…to find the log probability of S_D[1] within the distribution S_M[1]

You’ll likely be disappointed because the probability will be zero (or “vanishingly small” if you want to be technical).

There are various ways to extract a single \hat{x} (or \hat{S} in your case) that you can take as your estimate (disposing of the full posterior distribution) which can then be compared to your ground truth x (i.e. S_D). This isn’t my area of expertise, but you can check out the Bayesian estimator Wikipedia page for some basic info.

An even more Bayesian approach would be to define a loss/cost/objective function that quantifies the loss (or cost) of making each possible kind of error you might make (e.g., when making some sort of decision based on your model’s posterior), determining the probability of making each type of error (provided by your posterior), and then finding the expected loss associated with your model (and comparing it to the loss expected under some alternative model).

1 Like

Thanks Christian. (And also thank you for all you did putting together PyMCon. It was interesting and useful. I look forward to attending PyMCon 2 in person, after the virus abates.)

But I want to do exactly what you advise, if I understand you correctly. I want an objective function for an individual element of S to be the probability density of that element of S_M at that element of S_D, as evaluated by the posterior. So if the posterior of S_M for element 1 happened to look like this:

and the ground truth value for element 1 happened to be 0.6, then my objective function for element 1 would be roughly 0.46.

If the posterior of S_M for element 1 of an alternative model happened to look like this:

the objective function of the same ground truth value for element 1 would be roughly 0.011.

Of course the alternative model may score better on the other 99 elements. So I was planning to sum the logs of the probability densities, for an overall objective function.

But perhaps I misunderstand your suggestion?

The probability that your parameter (i.e., S) takes on a value of exactly 0.6 (i.e., 0.60000…) is zero. It makes sense to talk about the value of a random variable falling within some interval (e.g., 0.6 \pm .01 or 0.6 \pm .00001), but the probability of it taking on any one of its infinite possible values is zero.

But this leads us back toward the problem that loss functions are designed to address. If the true value of your parameter is 0.6, how bad would it be to erroneously conclude that the value was instead 0.60001? What if you erroneously conclude it’s 0.7? Or 0.01? Exclusively focusing on the inferences that yield the correct parameter value (e.g. 0.6) means that you end up ignoring all the potentially incorrect parameter values that you could instead infer, both the near misses and the ones that are way off.

If the true value of some parameter is 0.6, which of these two posteriors would you prefer your model to produce?

The former has non-zero density near 0.6. As a result, your scheme would score it “better” than the latter. With the latter, all credible values are incorrect, but pretty close.


Nice graphical examples!

And you raise an important point. I had imagined the posterior to be something
well-behaved, like the beta distributions in my examples, in which the probability
density at a particular point is a good objective function. But that’s a
misconception. In general the posterior will be something noisy, like what you

Thank you for your help.

1 Like