I had some time to join the conversation, I hope I can be of some help. I’ll start with the main take away which can be quite a bomb until you wrap your head around cross-validation and then explain why.
I want to begin by clarifying something that I think is of key importance to handle expectations of what ArviZ does. When working with several observed variables, there is not a single way of computing waic/loo nor or performing cross-validation.
Therefore, ArviZ will probably never work when doing az.loo(idata)
if there are multiple observed values. And, if we end up setting a default way of combining pointwise log likelihood values, it may not be what you want it to do, there will still be several correct ways of performing cross validation on the same model.
The key to choose the approach corresponding to the task at hand is first identifying the prediction task whose predictive accuracy we want to know and then make sure the exchangeability criteria is met (see for this blog post for more details).
In this example, the two prediction tasks that come right to mind are predicting the outcome of a new match or predicting the hyperparameters of a new team. Moreover, we could even be interested in the predictive accuracy of predicting the number of goals scored by either team, or the number of goals scored by the away team only.
Disclaimers: I have not checked the exchangeability criteria of the examples below. I use them to illustrate only ArviZ+xarray usage, not statistical correctness. If I have time I may sit down and look carefully at the model, still, I think ArviZ usage alone is still a good enough reason to write this answer.
I have used the data from the rugby example (it looks like the model is based on it) so the names will differ from your model, but the idea should get through. The good news is that I have checked the code runs
The first step is to convert the PyMC3 trace to ArviZ inference data object. I did this with:
dims = {
"home_points": ["match"],
"away_points": ["match"],
"home_team": ["match"],
"away_team": ["match"],
"atts": ["team"],
"atts_star": ["team"],
"defs": ["team"],
"defs_star": ["team"],
}
with model:
# this could be done in the beginning of the model definition
# using pm.Data tells ArviZ to store this data in constant_data group
home_team = pm.Data("home_team", home_team)
away_team = pm.Data("away_team", away_team)
idata = az.from_pymc3(trace, dims=dims)
# define helpers to make answer less verbose
log_lik = idata.log_likelihood
const = idata.constant_data
Leave one match out
If we wanted to predict the outcome of a future match, we would have to add the pointwise log likelihood corresponding to the home and away team on a per match basis. The code to achieve this is the following:
idata.sample_stats["log_likelihood"] = log_lik.home_points + log_lik.away_points
az.waic(idata)
# Output
# Computed from 3000 by 60 log-likelihood matrix
#
# Estimate SE
# elpd_waic -551.28 37.96
# p_waic 46.16 -
#
# There has been a warning during the calculation. Please check the results.
Leave one team out
We now have to group pointwise log likelihood values on a per team basis:
home_points_team = log_lik.home_points.groupby(const.home_team).sum().rename({"home_team": "team"})
away_points_team = log_lik.away_points.groupby(const.away_team).sum().rename({"away_team": "team"})
idata.sample_stats["log_likelihood"] = home_points_team + away_points_team
az.waic(idata)
# Output
# Computed from 3000 by 6 log-likelihood matrix
#
# Estimate SE
# elpd_waic -540.84 19.72
# p_waic 24.78 -
#
# There has been a warning during the calculation. Please check the results.
Leave one observation out
To get the predictive accuracy of predicting the number of goals (if it made sense and we were interested in this quantity) we have to concatenate the arrays instead of adding their values like we did in the first example:
import xarray as xr
idata.sample_stats["log_likelihood"] = xr.concat(
(log_lik.home_points, log_lik.away_points),
"match"
).rename({"match": "observation"})
az.waic(idata)
# Output
# Computed from 3000 by 120 log-likelihood matrix
#
# Estimate SE
# elpd_waic -550.95 35.23
# p_waic 45.77 -
#
# There has been a warning during the calculation. Please check the results.
Leave one observation out (away team only)
idata.sample_stats["log_likelihood"] = log_lik.away_points
az.waic(idata)
# Output
# Computed from 3000 by 60 log-likelihood matrix
#
# Estimate SE
# elpd_waic -268.79 23.07
# p_waic 20.56 -
#
# There has been a warning during the calculation. Please check the results.
Useful references
Regarding this specific example, the most useful reference is probably the Cross-validation for hierarchical models case study, the example there is quite similar to this one.
If facing a specific but kind of unknown/undefined problem, the cross-validation FAQ is a good place to start searching and get pointed in the right direction.
And as always, to get a more general and conceptual idea on the subject of cross-validation and information criteria, I’d recommend a book (and/or a course following a book). https://github.com/pymc-devs/resources is both a recommendation list of good books on the subject and a repository containing translations of code and exercises of said books to PyMC3.