How to evaluate the performance of a Bayesian classifier?


I’m working on a multiclass classification problem using Bayesian logistic regression.
How can I evaluate the performance of a given model?

A frequentist approach would be to use posterior predictive simulation and use the most probable class as (point)-prediction and then use the usual metrics (accuracy, precision…).
But in this way, I would lose the uncertainty regarding the predicted class.
Take for example a problem with four labels (1 to 4) and two classifiers.
The first one with posterior probabilities of P(y=1)=0.1, P(y=2)=0.4, P(y=3)=0.38, P(y=4)=0.12 and the second one with P(y=1)=0.5, P(y=2)=0.1, P(y=1)=0.38, P(y=4)=0.3. The true label is 3.
Both classifiers are wrong when using only the highest posterior probability.
But in my opinion, the first classifier is superior due to the high probability for label 2.

Is there a way to quantify this superiority?
Or in general, what would be the right way to deal with classification in a bayesian way?
I’ve thought about using divergence measures (Jensen-Shannon or Kullback-Leibler).