[Beginner level question on modeling] Bayesian analysis of F1 scores from two ML models

Yes, the data are F1 scores. You can get a feel for the data by running the estimate_beta_params.
For instance (simulating data for a model):

def estimate_beta_params(mu, var): 
  alpha = ((1 - mu) / var - 1 / mu) * mu**2
  beta = alpha * (1/ mu - 1)
  return (alpha, beta)

estimate_beta_params(0.75, 0.1) 

results_from_model_1 = np.random.beta(a=0.65, b=0.22, size=120)
results_from_model_1

Would result in an array as follows:

array([0.64012898, 0.99623235, 0.42080934, 0.99999896, 0.22052374,
       0.16008201, 0.26732771, 0.96165433, 0.2073106 , 0.73644397,
       0.32282837, 0.74649128, 0.93408431, 0.99145974, 0.9902438 ,
       0.24905937, 0.98678771, 0.91015317, 0.36951003, 0.0383655 ,
       0.99196919, 0.99976532, 0.99786201, 0.99241236, 0.97478833,
       0.38688449, 0.98292308, 0.83390483, 0.99999994, 0.85605049,
       0.85979515, 0.26567911, 0.99999959, 0.9961947 , 0.03926723,
       0.99995637, 0.93542531, 1.        , 0.33429166, 0.44179379,
       0.55468918, 0.56027303, 0.96545908, 0.99802973, 0.9995173 ,
       0.94268163, 0.91577903, 0.34161971, 0.89171426, 0.19107768,
       0.46173997, 0.06780729, 0.9986046 , 0.91034444, 0.88941731,
       0.27467245, 1.        , 0.14479665, 0.26347849, 0.95849152,
       0.30109627, 0.99392332, 0.31133103, 0.99843967, 0.94574275,
       0.88638523, 0.35566542, 0.44832093, 0.95193456, 1.        ,
       0.95068676, 0.98895839, 0.00877955, 0.58098781, 0.99863503,
       0.54451651, 0.26174459, 0.99966453, 0.72931634, 0.76719711,
       0.1639244 , 0.82652736, 0.06960771, 0.82958025, 0.8691868 ,
       0.69978128, 0.33643546, 0.05567993, 0.99961485, 0.87182994,
       0.79650009, 0.94770892, 0.78782211, 0.12369101, 0.99956335,
       0.06348011, 0.99433423, 0.99960905, 0.03384506, 0.36287513,
       0.47959614, 0.9921098 , 0.99027691, 0.95916418, 0.83623727,
       0.85359268, 0.99999792, 0.99575783, 0.98885907, 0.99937895,
       0.26840522, 0.95405393, 0.67123609, 0.99999049, 0.73267442,
       0.98704977, 0.329102  , 0.99886656, 0.99993992, 0.9182726 ])

(@cluhmann although the above is synthetic or “fake” data, that’s pretty much what I’m planning to get from running a 10-fold cross validation for each ML model. )

Essentially, I’ll have 120 F1 scores for each model. I want to compare these two models in terms of the average F1 score they yield. The assumption here is that the ML model that tends to yield higher F1 scores is the most appropriate model.

So basically the question I’m trying to answer is “how does model 1 stack against model 2 (in terms of F1)?”