[Beginner level question on modeling] Bayesian analysis of F1 scores from two ML models

Hi, folks :wave:
I’m new to PyMC and Bayesian modeling/analysis.

I have been trying to use PyMC to run an experiment in which I compare the results from two machine learning models. Specifically, I’m interested in comparing them in terms of their F1 scores (whose values range from 0 to 1 – and I get from running 10 fold cross validation on both models):

So, to get things going and start making modeling choices, I simulate experiment data as follows:

def estimate_beta_params(mu, var): 
  alpha = ((1 - mu) / var - 1 / mu) * mu**2
  beta = alpha * (1/ mu - 1)
  return (alpha, beta)

estimate_beta_params(0.75, 0.1)

results_from_model_1 = np.random.beta(a=0.65, b=0.22, size=120)
results_from_model_2 = np.random.beta(a=0.65, b=0.22, size=120)

However, as I mentioned, I’m new to Bayesian modeling in general, so I’m not sure how to fully specify a probabilistic model for the aforementioned comparison.
So far, I decide to apply Beta priors on the means because their values range from 0 to 1. Additionally, as a simplifying assumption, I assume that the mean of both groups is 0.75 (and variance is 0.1) (see method above).

with pm.Model() as model:
  # given that the values for the means (i.e., F1) range from 0 to 1, 
  # apply Beta priors on them
  # also, _arbitrarily_ set the hyperparameters to mu = 0.75 and variance = 0.1
  # (see estimate_beta_params method)
  model1_mean = pm.Beta("model 1 mean", alpha = 0.65, beta = 0.22)
  model2_mean = pm.Beta("model 2 mean", alpha = 0.65, beta = 0.22)

After going over some of the named distributions, I am still not sure how to specify likelihoods. Any suggestions?

Can you provide some sketch of what the data is? Is the data the F1 sores themselves? And what’s the question you are trying to answer?

Yes, the data are F1 scores. You can get a feel for the data by running the estimate_beta_params.
For instance (simulating data for a model):

def estimate_beta_params(mu, var): 
  alpha = ((1 - mu) / var - 1 / mu) * mu**2
  beta = alpha * (1/ mu - 1)
  return (alpha, beta)

estimate_beta_params(0.75, 0.1) 

results_from_model_1 = np.random.beta(a=0.65, b=0.22, size=120)
results_from_model_1

Would result in an array as follows:

array([0.64012898, 0.99623235, 0.42080934, 0.99999896, 0.22052374,
       0.16008201, 0.26732771, 0.96165433, 0.2073106 , 0.73644397,
       0.32282837, 0.74649128, 0.93408431, 0.99145974, 0.9902438 ,
       0.24905937, 0.98678771, 0.91015317, 0.36951003, 0.0383655 ,
       0.99196919, 0.99976532, 0.99786201, 0.99241236, 0.97478833,
       0.38688449, 0.98292308, 0.83390483, 0.99999994, 0.85605049,
       0.85979515, 0.26567911, 0.99999959, 0.9961947 , 0.03926723,
       0.99995637, 0.93542531, 1.        , 0.33429166, 0.44179379,
       0.55468918, 0.56027303, 0.96545908, 0.99802973, 0.9995173 ,
       0.94268163, 0.91577903, 0.34161971, 0.89171426, 0.19107768,
       0.46173997, 0.06780729, 0.9986046 , 0.91034444, 0.88941731,
       0.27467245, 1.        , 0.14479665, 0.26347849, 0.95849152,
       0.30109627, 0.99392332, 0.31133103, 0.99843967, 0.94574275,
       0.88638523, 0.35566542, 0.44832093, 0.95193456, 1.        ,
       0.95068676, 0.98895839, 0.00877955, 0.58098781, 0.99863503,
       0.54451651, 0.26174459, 0.99966453, 0.72931634, 0.76719711,
       0.1639244 , 0.82652736, 0.06960771, 0.82958025, 0.8691868 ,
       0.69978128, 0.33643546, 0.05567993, 0.99961485, 0.87182994,
       0.79650009, 0.94770892, 0.78782211, 0.12369101, 0.99956335,
       0.06348011, 0.99433423, 0.99960905, 0.03384506, 0.36287513,
       0.47959614, 0.9921098 , 0.99027691, 0.95916418, 0.83623727,
       0.85359268, 0.99999792, 0.99575783, 0.98885907, 0.99937895,
       0.26840522, 0.95405393, 0.67123609, 0.99999049, 0.73267442,
       0.98704977, 0.329102  , 0.99886656, 0.99993992, 0.9182726 ])

(@cluhmann although the above is synthetic or “fake” data, that’s pretty much what I’m planning to get from running a 10-fold cross validation for each ML model. )

Essentially, I’ll have 120 F1 scores for each model. I want to compare these two models in terms of the average F1 score they yield. The assumption here is that the ML model that tends to yield higher F1 scores is the most appropriate model.

So basically the question I’m trying to answer is “how does model 1 stack against model 2 (in terms of F1)?”

I guess you can do something like this:

with pm.Model() as model:
    a1 = pm.Exponential("a1", 1)
    a2 = pm.Exponential("a2", 1)
    b1 = pm.Exponential("b1", 1)
    b2 = pm.Exponential("b2", 1)
    model1_like = pm.Beta("model1_like", alpha = a1, beta = b1, observed=model1_f1s)
    model2_like = pm.Beta("model2_like", alpha = a2, beta = b2, observed=model2_f1s)

That being said, this is still pretty much a blackbox model that captures nothing of the process by which the data (the F1 scores) were generated. Thus, I’m not sure this model gets you much above and beyond a variety of much simpler approachs (e.g., a t-test or an effect size such as the probability of superiority, etc.).

1 Like

Wow! That was really helpful. :clap:
Honestly, I got carried away by the prospect of getting a better explanation for the difference in performance data that I didn’t really think about the fact that I was trying to model the wrong part of the problem. :smiling_face_with_tear:

I assumed that as long as I managed to come up with decent priors (not based on the values in the data, but simply on what I know about the data beforehand, that is, on what the F1 scores tend to be like) the posterior distribution would take the form of the probability of the parameters conditional on the data (F1 scores in observed) and so I’d end up getting a better representation of what the F1 scores from these models look like. So, basically I thought I could come up with a better answer by betting on reasonable priors and the model conditioning to the results.

This isn’t necessarily wrong. But maybe you can ask yourself some questions to see what sorts of insights you might expect from any particular modeling endeavor. If you calculated the mean of the F1 from one of your models, do you believe it’s a decent approximation of that model’s “true” F1 scores? If so, why are you searching for “a better representation”? If not why not?

The standard scenario in which modeling (of any sort) becomes useful/necessary is when the answers are beyond the data you have on hand. Maybe I have measurements of rainfall at various locations, but I know the measurement sensors will fail to produce readings once temperatures are above 40 degrees (biasing the means downward). Or maybe one of your models has very good behavior on some data sets (folds) and very poor on others for reasons you think you understand. In each of these cases, your model would seek to describe some unobservable thing (censoring in the former case, data “problems” in the latter) in order to better make sense of the data you do observe.

Hope that helps contextualize my previous comments.

1 Like