[Beginner level question on modeling] Bayesian analysis of F1 scores from two ML models

This isn’t necessarily wrong. But maybe you can ask yourself some questions to see what sorts of insights you might expect from any particular modeling endeavor. If you calculated the mean of the F1 from one of your models, do you believe it’s a decent approximation of that model’s “true” F1 scores? If so, why are you searching for “a better representation”? If not why not?

The standard scenario in which modeling (of any sort) becomes useful/necessary is when the answers are beyond the data you have on hand. Maybe I have measurements of rainfall at various locations, but I know the measurement sensors will fail to produce readings once temperatures are above 40 degrees (biasing the means downward). Or maybe one of your models has very good behavior on some data sets (folds) and very poor on others for reasons you think you understand. In each of these cases, your model would seek to describe some unobservable thing (censoring in the former case, data “problems” in the latter) in order to better make sense of the data you do observe.

Hope that helps contextualize my previous comments.

1 Like