Summarize inference data (HDI)

Hi!

When I summarize the statistics of my inference data using az.summary(), one of the columns of the dataframe is hdi_3%. A highest density interval is supposed to be a range of two values, but in the summarized data each entry of this column is one specific value. What does this value mean?

I am sorry if this is a very basic question.

Thank you in advance!

1 Like

When requesting the 94% HDI (arviz’s current default), there should be 2 columns, one representing the left end of the interval (the value below which 3% of the posterior falls) and one representing the right end of the interval (the value above which 3% of the posterior falls). So hdi_3% is the former and there should be a hdi_97% column with the latter.

1 Like

To just add to @cluhmann , I found the following command very handy:

az.hdi(trace,var_names=["XXX"], hdi_prob = 0.80).values()

It will print out the hdi of whatever variables you want from your trace with the specified hdi_prob value.

3 Likes

Hi!

I am sorry (again) if this is a very basic question, but I am really confused. I understand that the HDI gives us the lowest credible interval of a distribution that gives us a specific probability. How do we calculate the HDI for samples from a distribution?

In addition, I cannot understand the difference between a HDI and a confidence interval.

The arviz code used to compute the HDI is here.

The difference between an HDI and a confidence interval is a much longer discussion. I recommend the wikipedia articles here and here.

3 Likes

Sorry, doesn’t this naming assume that the HDI and the equal-tail intervals coincide?

Which naming?

hdi_3% / hdi_97%.

A simple example that shows the issue:

import arviz as az
import pymc as pm

with pm.Model() as model:
    pm.Exponential('exp', lam=.01)
    trace = pm.sample(10000)

print(az.summary(trace))

Output:

       mean      sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
exp  100.48  99.953   0.007  280.653      0.805    0.583   10419.0    9841.0    1.0

But there isn’t 3% of the posterior below 0.007, there is almost 0% of it. 3% of the posterior is under about 3.046. And at the other end of the interval, it’s about 94% of the posterior that’s under 280.653, not 97%. 97% is under ~350.66.

It’s correct as an HDI, but there aren’t 3% of the posterior on each side, contrary to what the names of the columns imply; there is 0% on the left and 6% on the right.

Correct, my initial description of the intervals was for equal-tailed intervals, which HDIs are typically not (but are sometimes!). But worse than that, there are many HDIs (intervals that include N% of the mass) that are not equal-tailed. The arviz hdi function returns exactly one which is the HDI with the minimum width. There are also issues once you are dealing with distributions that are mutlimodal. The bottom line is that saying you are presenting “an HDI” doesn’t really indicate what you have constructed (or how).

As far as I know, this is the defining feature of an HDI (HD = “highest-density”), isn’t it? But thanks for confirming my suspicion about the naming. Would you happen to know of another possible reason why the bounds are named 3% and 97%? Should I perhaps open an issue on ArviZ’s GitHub?

This is what I get for answering quickly. There are not unique credible intervals, the credible intervals of minimum width are highest density intervals and are unique for unimodal distributions (but not otherwise).

You can definitely open an issue to inquire about the interval end naming. I would be curious to see what they say.

CC @OriolAbril

Yeah, the naming in arviz.summary is not ideal. Not 100% sure why (might have used ETIs at some point or aim to support both or just be a confusion). The hdi function doesn’t use this naming and sets as coordinates 'lower', 'higher'. Not sure if we already have an issue but a PR would be welcome. Thinking out loud, to keep the column names short for the summary dataframe, we might want to use hdi_low and hdi_up?

Note now there is a stats_focus argument (last example in the docstring) and naming for ETI should continue to have the specific probabilities.