Calculating WAIC for models with multiple likelihood functions

I try to predict the outcome of soccer games and I use the following model:

    with pm.Model() as model:
      # global model parameters
       h = pm.Normal('h', mu = 0, tau = 0.0001)
       sd_a = pm.Gamma('sd_a', .1, .1) 
       sd_d = pm.Gamma('sd_d', .1, .1) 
       alpha = pm.Normal('alpha', mu=0, tau = 0.0001)

      # team-specific model parameters
       a_s = pm.Normal("a_s", mu=0, sd=sd_a, shape=num_teams)
       d_s = pm.Normal("d_s", mu=0, sd=sd_d, shape=num_teams)

       atts = pm.Deterministic('atts', a_s - tt.mean(a_s))
       defs = pm.Deterministic('defs', d_s - tt.mean(d_s))
       h_theta = tt.exp(alpha + h + atts[home_team] + defs[away_team])
       a_theta = tt.exp(alpha + atts[away_team] + defs[home_team])

    # likelihood of observed data
       h_goals = pm.Poisson('h_goals', mu=h_theta, observed=observed_h_goals)
       a_goals = pm.Poisson('a_goals', mu=a_theta, observed=observed_a_goals)

When I sample the model, the trace plots look fine.

Afterward when I want to calculate the WAIC:

waic = pm.waic(trace, model)

I get the following error:

----> 1 waic = pm.waic(trace, model)

~\Anaconda3\envs\env\lib\site-packages\pymc3\stats\ in wrapped(*args, **kwargs)
     22                 )
     23                 kwargs[new] = kwargs.pop(old)
---> 24             return func(*args, **kwargs)
     26     return wrapped

~\Anaconda3\envs\env\lib\site-packages\arviz\stats\ in waic(data, pointwise, scale)
   1176     """
   1177     inference_data = convert_to_inference_data(data)
-> 1178     log_likelihood = _get_log_likelihood(inference_data)
   1179     scale = rcParams["stats.ic_scale"] if scale is None else scale.lower()

~\Anaconda3\envs\env\lib\site-packages\arviz\stats\ in get_log_likelihood(idata, var_name)
    403             var_names.remove("lp")
    404         if len(var_names) > 1:
--> 405             raise TypeError(
    406                 "Found several log likelihood arrays {}, var_name cannot be None".format(var_names)
    407             )

TypeError: Found several log likelihood arrays ['h_goals', 'a_goals'], var_name cannot be None

Is there any way to calculate WAIC and compare models when I have two likelihood functions in pymc3? (1: the goals score by the home team 2: the goals scored by the away team)


Hi Delbar,
This is precisely an ArviZ GSoc project.

Until this is done, a workaround is to manually combine log likelihood data into a single array and store it in sample_stats as log_likelihood variable. Here is a WIP example of how to do that, by @OriolAbril.

Hope this helps :vulcan_salute:


Awesome, thanks!

Any idea on when they expect to be done?

Unfortunately, I don’t think we have a candidate for this project yet, so ETA == TBD :confused:

I had some time to join the conversation, I hope I can be of some help. I’ll start with the main take away which can be quite a bomb until you wrap your head around cross-validation and then explain why.

I want to begin by clarifying something that I think is of key importance to handle expectations of what ArviZ does. When working with several observed variables, there is not a single way of computing waic/loo nor or performing cross-validation.

Therefore, ArviZ will probably never work when doing az.loo(idata) if there are multiple observed values. And, if we end up setting a default way of combining pointwise log likelihood values, it may not be what you want it to do, there will still be several correct ways of performing cross validation on the same model.

The key to choose the approach corresponding to the task at hand is first identifying the prediction task whose predictive accuracy we want to know and then make sure the exchangeability criteria is met (see for this blog post for more details).

In this example, the two prediction tasks that come right to mind are predicting the outcome of a new match or predicting the hyperparameters of a new team. Moreover, we could even be interested in the predictive accuracy of predicting the number of goals scored by either team, or the number of goals scored by the away team only.

Disclaimers: I have not checked the exchangeability criteria of the examples below. I use them to illustrate only ArviZ+xarray usage, not statistical correctness. If I have time I may sit down and look carefully at the model, still, I think ArviZ usage alone is still a good enough reason to write this answer.

I have used the data from the rugby example (it looks like the model is based on it) so the names will differ from your model, but the idea should get through. The good news is that I have checked the code runs :smile:

The first step is to convert the PyMC3 trace to ArviZ inference data object. I did this with:

dims = {
    "home_points": ["match"],
    "away_points": ["match"],
    "home_team": ["match"],
    "away_team": ["match"],
    "atts": ["team"],
    "atts_star": ["team"],
    "defs": ["team"],
    "defs_star": ["team"],
with model:
    # this could be done in the beginning of the model definition
    # using pm.Data tells ArviZ to store this data in constant_data group
    home_team = pm.Data("home_team", home_team)
    away_team = pm.Data("away_team", away_team)
    idata = az.from_pymc3(trace, dims=dims)

# define helpers to make answer less verbose
log_lik = idata.log_likelihood
const = idata.constant_data

Leave one match out

If we wanted to predict the outcome of a future match, we would have to add the pointwise log likelihood corresponding to the home and away team on a per match basis. The code to achieve this is the following:

idata.sample_stats["log_likelihood"] = log_lik.home_points + log_lik.away_points
# Output
# Computed from 3000 by 60 log-likelihood matrix
#           Estimate       SE
# elpd_waic  -551.28    37.96
# p_waic       46.16        -
# There has been a warning during the calculation. Please check the results.

Leave one team out

We now have to group pointwise log likelihood values on a per team basis:

home_points_team = log_lik.home_points.groupby(const.home_team).sum().rename({"home_team": "team"})
away_points_team = log_lik.away_points.groupby(const.away_team).sum().rename({"away_team": "team"})
idata.sample_stats["log_likelihood"] = home_points_team + away_points_team
# Output
# Computed from 3000 by 6 log-likelihood matrix
#           Estimate       SE
# elpd_waic  -540.84    19.72
# p_waic       24.78        -
# There has been a warning during the calculation. Please check the results.

Leave one observation out

To get the predictive accuracy of predicting the number of goals (if it made sense and we were interested in this quantity) we have to concatenate the arrays instead of adding their values like we did in the first example:

import xarray as xr
idata.sample_stats["log_likelihood"] = xr.concat(
    (log_lik.home_points, log_lik.away_points), 
).rename({"match": "observation"})
# Output
# Computed from 3000 by 120 log-likelihood matrix
#           Estimate       SE
# elpd_waic  -550.95    35.23
# p_waic       45.77        -
# There has been a warning during the calculation. Please check the results.

Leave one observation out (away team only)

idata.sample_stats["log_likelihood"] = log_lik.away_points
# Output
# Computed from 3000 by 60 log-likelihood matrix
#           Estimate       SE
# elpd_waic  -268.79    23.07
# p_waic       20.56        -
 # There has been a warning during the calculation. Please check the results.

Useful references

Regarding this specific example, the most useful reference is probably the Cross-validation for hierarchical models case study, the example there is quite similar to this one.

If facing a specific but kind of unknown/undefined problem, the cross-validation FAQ is a good place to start searching and get pointed in the right direction.

And as always, to get a more general and conceptual idea on the subject of cross-validation and information criteria, I’d recommend a book (and/or a course following a book). is both a recommendation list of good books on the subject and a repository containing translations of code and exercises of said books to PyMC3.


Thanks a lot Oriol, that’s super useful!
Just a small question, to be sure: the difference between “leave one match out” (i.e predicting the outcome of a future match) and “leave one observation out” (i.e predicting the number of goals) is that in the first case you’re interested in the detailed outcome (e.g 2 goals for home team and 1 for away team), while in the second case you’re interested in the aggregate outcome (e.g 3 goals were scored in the whole match)?

1 Like

In predicting the outcome of a future match, we are considering our random variable to be intrinsically multivariate, the pointwise likelihood is then p(\mathbf{y_i}|x_i, \theta) where \mathbf{y_i} = (y_i^{(h)}, y_i^{(a)}), \theta are the model parameters and x_i would encode which team plays home and which team corresponds to each superindex.

It is therefore different from the aggregate outcome as the aggregate outcome would not distinguish between home and away goals. That is, if we only look at the aggregate, \mathbf{y_i}=(1,2) is equivalent to \mathbf{y_i}=(2,1), \mathbf{y_i}=(3,0) and \mathbf{y_i}=(0,3). I tried to be more specific with the pointwise log likelihoods that correspond to these two cases which gets more mathematical, I hope it helps and the notation is not too obscure.

In predicting one observation (which now I realize is probably bad naming), the pointwise likelihood is p(y_i|x_i, \theta) now with y_i being a scalar and x_i and \theta encoding the same info.

The specifics of writing the exact pointwise likelihood function would still depend on the parametrization, but to define the predictive task, the definition of y_i is what matters. One possible parametrization for the match case would be:

p(\mathbf{y_i}| \mathbf{t_i}, ...) = \text{Pois}(y_i^{(h)}; \alpha + h + atts_{t_i^{(h)}} + defs_{t_i^{(a)}}) \text{Pois}(y_i^{(a)}; \alpha + atts_{t_i^{(a)}} + defs_{t_i^{(h)}})

and for the observation case:

p(y_i| h_i, \mathbf{t_i}, ...) = \text{Pois}(y_i; \alpha + h h_i + atts_{t_i^{h_i}} + defs_{t_i^{h_i}})

in both cases, atts and defs are vectors of length team containing the team performaces and \mathbf{t_i} contains the team indexes (needed to get the correct values from atts and defs). In this second case, h_i would be a binary variable being 1 if we aim to predict the goals scored by the home team and 0 if we aim to predict the goals scored by the away team. Therefore h_i determines whether or not to include the home effect and it also determines which component of \mathbf{t_i} is to be used to index atts and which should index defs.

1 Like

@OriolAbril - I’m wondering if there’s a way to add the log-likelihoods in a for loop. I have multiple log-likelihoods that I’d like to add programmatically instead of hard-coding them.

For example -

loglik = idata.log_likelihood

And instead of doing -

idata.sample_stats["log_likelihood"] = log_lik.home_points + log_lik.away_points

I’d like to do something like -

for i in ['home_points', 'away_points']:
    idata.sample_stats["log_likelihood"] += log_lik[i]

I’m wondering if there is a way to initialize the log_likelihood key in sample_stats so that I don’t get a key error while trying to add multiple log likelihoods in a loop

You should be able to do something like:

var_names = ['home_points', 'away_points']
idata.log_likelihood = xr.concat(
    [idata.log_likelihood[var_name] for var_name in var_names], 

This aligns all the dataarrays and concatenates them along a new dimension aux_dim which is then reduced with sum.

Also take a look at this notebook and store the results on the log_likelihood group as we deprecated storing them in the sample stats one.

This is very helpful! Thanks! What version of PyMC3 is this? I’ll update PyMC3 and Theano then…

The return_inferencedata=True didn’t exist earlier I believe.

Also, I try the following -

idata.log_likelihood = xr.concat(
    [idata.log_likelihood[var_name] for var_name in var_names], 

Then I try to use the az.waic as follows =

az.waic(idata), which gives me the error -
AttributeError: 'DataArray' object has no attribute 'data_vars'

I then try -

idata.log_likelihood['samplenamehere'] = xr.concat(
    [idata.log_likelihood[var_name] for var_name in var_names], 

Followed by -
az.waic(idata, var_name='samplenamehere')

which gives the error -
KeyError: 'chain'

Is there some version mismatch issue?

I have the following package versions -
pymc3 = 3.9.3
theano = 1.0.5
arviz = 0.10.0
xarray = 0.16.1

There should be no version mismatch after updating, can you share the output of idata.log_likelihood['samplenamehere']?

I am able to reproduce this for the example given in the ipynb, however this does not work for my test case and gives the following error -

KeyError                                  Traceback (most recent call last)
<ipython-input-42-1c1562dac054> in <module>
----> 1 az.waic(idata, var_name='samplenamehere')

~/.conda/envs/xo/lib/python3.6/site-packages/arviz/stats/ in waic(data, pointwise, var_name, scale)
   1405         raise TypeError('Valid scale values are "deviance", "log", "negative_log"')
-> 1407     log_likelihood = log_likelihood.stack(sample=("chain", "draw"))
   1408     shape = log_likelihood.shape
   1409     n_samples = shape[-1]

~/.conda/envs/xo/lib/python3.6/site-packages/xarray/core/ in stack(self, dimensions, **dimensions_kwargs)
   1869         DataArray.unstack
   1870         """
-> 1871         ds = self._to_temp_dataset().stack(dimensions, **dimensions_kwargs)
   1872         return self._from_temp_dataset(ds)

~/.conda/envs/xo/lib/python3.6/site-packages/xarray/core/ in stack(self, dimensions, **dimensions_kwargs)
   3386         result = self
   3387         for new_dim, dims in dimensions.items():
-> 3388             result = result._stack_once(dims, new_dim)
   3389         return result

~/.conda/envs/xo/lib/python3.6/site-packages/xarray/core/ in _stack_once(self, dims, new_dim)
   3339         # consider dropping levels that are unused?
-> 3340         levels = [self.get_index(dim) for dim in dims]
   3341         idx = utils.multiindex_from_product_levels(levels, names=dims)
   3342         variables[new_dim] = IndexVariable(new_dim, idx)

~/.conda/envs/xo/lib/python3.6/site-packages/xarray/core/ in <listcomp>(.0)
   3339         # consider dropping levels that are unused?
-> 3340         levels = [self.get_index(dim) for dim in dims]
   3341         idx = utils.multiindex_from_product_levels(levels, names=dims)
   3342         variables[new_dim] = IndexVariable(new_dim, idx)

~/.conda/envs/xo/lib/python3.6/site-packages/xarray/core/ in get_index(self, key)
    371         """Get an index for a dimension, with fall-back to a default RangeIndex"""
    372         if key not in self.dims:
--> 373             raise KeyError(key)
    375         try:

KeyError: 'chain'

I believe, this might have to do with how I’m (not) specifying dimensions and coordinates, especially because the two models that I’m trying to combine the log-likelihoods for, have different sizes, i.e. the sample size is different for each one. I’ll dig deeper into how dimensions and coordinates are used, since I have not been using them with exoplanet so far.

This is probably the reason, because it makes xr.concat unable to properly concatenate the arrays.

I believe this should not be an issue because ArviZ automatically creates the chain and the draw dimensions, which is why it assumes they will be present. Here, your idata.log_likelihood['samplenamehere'] somehow does not have the chain dim (probably due to bad concatenating) and waic fails when trying to stack chain and draw dimensions.

Is there a standard way to combine log likelihoods (i.e. concatenate the arrays) for samples with different sizes? I would imagine that piecewise addition fails here.

I’d refer back to this example notebook I mentioned a while back. I am quite sure that this concatenation is not the answer to what your are trying to know, different shapes most probably mean that different predictive tasks are being defined.

For the sake of completeness, xarray can definitely do that even if automatic alignment and broadcasting fails by using some manual calls to xr.align and/or xr.broadcast.

Do we need to combine them piecewise? Is it not equivalent to add the sums of the piecewise log likelihood, i.e. log_like['part_a'].sum() + log_like['part_b'].sum()

I guess that might work for other calculations which require the log-likelihood, perhaps something like the bayesian evidence (ln Z)? But not for the waic and loo calculations within pymc3.

It is not equivalent because what waic and loo are trying to estimate is the predictive accuracy on unseen data. To do so, each uses a different approximation to estimate this quantity of interest whithout having any unseen data available. In loo’s case for example, the approximation is base on the fact that p(y^{i}|y_{-i}), the probability of observing the i-th datapoint conditional on all the other datapoints but itself can be approximated by p(y^i|\theta)^{-1}.

p(y^{i}|y_{-i}) is what one would calculate when doing exact leave one out cross validation. y^i is the holdout data and y_{-i} the data available to the model. Thanks to this approximation and to having the pointwise log likelihood values p(y^i|\theta) we can approximate this predictive accuracy on unseen data. We need to have the pointwise data and at the same time the pointwise data defines how do we do this cross validation on the observed data.

A more detailed explanation on both waic and loo can be found in


That makes sense. Thank you for the explanation @OriolAbril… That also explains why combining the log-likelihoods with different sizes is much trickier. I’ll go through the reference too.

At least in astronomy applications, this is a problem we often encounter since the sampling can be very non-uniform, and at different intervals across instruments. Later if and when time permits, I think this would be a very useful feature to include if possible…

In the meantime I can perhaps calculate AIC, BIC, etc using the summed log likelihood and number of parameters.


It would be great if you could create a new topic explaining your situation in some more detail. I would not recommend switching to AIC or DIC, they are trying to estimate the same quantity and do a worse job, especially in a Bayesian setting and non gaussian posteriors. They also lack warnings for when the estimation fails, so even if you manage to get a value for them, I would not trust it.

I’ll try to use an astronomy example to see if I can better explain why I think the underlying issue that needs adressing is the definition of the predictive task.

Imagine you have observations of AU Mic’s lightcurve that you want to use to study the exoplanet transiting it, but some observations are from the spitzer telescope and others are from tess. I guess you would have a physical model of the transit that would then be unfolded and added an instrumental component that would differ between the data coming from tess and the one from spitzer. Due to this difference in their instrumental component, you would probably have 2 likelihood functions, one for each instrument. Let’s assume the tess lightcurve has 30 points (I’ll call these datapoints observations) and spitzer one has 13 observations.

So far, regarding the observations and likelihood, this model is extremely similar to the example I shared above that uses football data. It has one likelihood for away goals and another for home goals. Therefore, here like in the football example we can define multiple predictive tasks. They are all technically correct, but answer different questions!

One task could be assessing the accuracy of predicting spitzer observations only, thus we’d do the same we do in the example when assessing the predictive accuracy of away goals only. This cpuld be useful for example to compare this model with an old model that used only spitzer observations, and see if adding the tess data improves the accuracy.

Likewise, the same can be done with tess observations only.

Another option is to asses the accuracy of predicting any observation, independently of the instrument that recorded it. Again, this is like the “predicting the scored goals per match and per team” example. We are concatenating the two arrays but not adding them (note that concatenating the two arrays of length 240 into one 240x2 or one 480 array makes no difference precisely because we are not adding the values between them)

In this lightcurve case however, the “predicting the outcome of a match” thus summing the away and home likelihoods pointwise does not make sense. You would be defining the predictive task as predicting one tess and one spitzer observation, but many questions arise: Does this question answer anything of interest? How do you pair tess with spitzer observations? Why are you pairing them?