DIC, WAIC, WBIC on regression tasks

Hello everyone,
I have two questions. The first one is related to the evaluation of models that are not predictive. I am interested on fitting a GLM (Poisson) model to describe the coefficient of the covariates that are correlated with my dependent variable. Thus, I am not interested in out-of-sample predictions.

I’ve seen many papers using DIC as an evaluation metric (e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4282098/). In the pymc3 github I also read about many people suggesting WAIC as an evaluation metric. After reading this paper (“Understanding predictive information criteria for Bayesian models” https://link.springer.com/article/10.1007/s11222-013-9416-2) my questions are:

  1. Isn’t WAIC “just” related to out-of-sample/generalization prediction? So does it make sense to use it in my project?

  2. What should I do when I have these warnings? Can I solve them somehow? (not clear to me how). If not, can I trust WAIC?

     DIC 217.868591309
     /home/nadai/.local/lib/python3.4/site-packages/pymc3/stats.py:213: UserWarning: For one or more samples the posterior variance of the
             log predictive densities exceeds 0.4. This could be indication of
             WAIC starting to fail see http://arxiv.org/abs/1507.04544 for details
    
       """)
     WAIC WAIC_r(WAIC=158.42521828487168, WAIC_se=2.3092993247226805, p_WAIC=7.2801294)
     /home/nadai/.local/lib/python3.4/site-packages/pymc3/stats.py:278: UserWarning: Estimated shape parameter of Pareto distribution is
             greater than 0.7 for one or more samples.
             You should consider using a more robust model, this is
             because importance sampling is less likely to work well if the marginal
             posterior and LOO posterior are very different. This is more likely to
             happen with a non-robust model and highly influential observations.
       happen with a non-robust model and highly influential observations.""")
     LOO LOO_r(LOO=164.91826, LOO_se=2.4069428499178245, p_LOO=10.526650309191908)
    
  3. Are papers using DIC in hierarchical models all wrong?

Thanks!

Hi Marco,

  1. Both DIC and WAIC are related to out-of-sample/generalization prediction. I think this is a general good metric to evaluate models, even when you care more about the parameters than about the predictions. The general idea is that if your model and parameters are a good description of the underlaying phenomena or process that you are studying they should be able to predict unobserved (but observable) future data.

  2. If you get a warning you have a couple of options (besides ignoring them) to use other methods, like use LOO instead of WAIC (or vice versa), use K-fold cross validation, change your model and use one that is more robust. Of course to compare your models you can also add to the mix posterior predictive checks (although this in an in-sample analysis.) and background information.
    A little bit more about the warnings. Those are based on empirical observations. Is my opinion that we need more work on this, but as this point this the best thing we have. I have been thinking in adding tools to help diagnose or at least visualize the problematic points, thanks for reminding me about this! Notice that when using DIC/BPIC you always get a nice result without any warnings (even in assumptions are not met) and that could lead to overconfidence!

  3. DIC assumes the Posterior is Gaussian, the more you move away from this assumption the more misleading the values of DIC will be. Someone corrects if I am wrong, but is my understanding that hierarchical models tend to have non-gaussian Posteriors. Also WAIC is more Bayesian because you are averaging over the posterior distribution.

1 Like

Thanks your answer was very complete! Especially the second point is important to me.

What about WAIC/DIC in presence of random effects such as in the Conditional Autoregressive (CAR) model? If I compute the DIC/WAIC on i) a model with just the random effect variables and ii) a model with random effects AND covariates, I always get lower or equal DIC/WAIC for the former.

How do you interpret these results? I think the random effects get confusion on the metrics, don’t they? Do you have any experience on this?

I ask this because this is one of my source of confusion about the evaluations.

You are welcome.

I am not familiar with CAR model (I will read about it!), but if you can consider you data points as independent then you can use WAIC. I think Gelman has an example where they have a time series, but for their purpose they model them as a simple linear regression and hence they can apply WAIC.

I am trying to understand the limitations of WAIC so if I find something I will let you know. Thanks for your question.

I used WAIC and LOO, both of them occurred the warning,
WAIC: UserWarning: For one or more samples the posterior variance of the log predictive densities exceeds 0.4. This could be indication of WAIC starting to fail.
And LOO: Estimated shape parameter of Pareto distribution is greater than 0.7 for one or more samples. You should consider using a more robust model, this is because importance sampling is less likely to work well if the marginal posterior and LOO posterior are very different. This is more likely to happen with a non-robust model and highly influential observations.

That means my model is not good, can I use the WAIC and LOO as the referrence of model selection?

It means that you have at least one observation that may be problematic. This could be a problem in your data, like an input mistake (you wrote 200, instead of 2), or more generally the model is not able to actually model the observation(s). For example you are modeling count data using a Poisson distribution, but your data is overdispersed so a NegativeBinomial will be probably a better idea. To help diagnose the problem you can use LOO and functions like arviz.plot_khat — ArviZ dev documentation and arviz.plot_elpd — ArviZ dev documentation

If you are seeing those warnings, it means that the approximations used to compute WAIC and LOO may not be reliable. So it is better to solve those problems. Alternative you can use az.LOO to get the ELPD of the non-problematic observations (k hat <0.7) and then explicitly compute the ELPD for the problematic observations (k hat > 0.7) by refitting the model and actually leaving one observation out. Of course this is only a good idea if you just have a few points, otherwise the cost of refitting the model many times will be too expensive. You can read more about this here Articles • loo


I tried to plot the khat and the result is as shown. only one datapoint is over 0.7.