Summarize inference data (HDI)

j_catulo · October 29, 2022, 7:46pm

Hi!

When I summarize the statistics of my inference data using az.summary(), one of the columns of the dataframe is hdi_3%. A highest density interval is supposed to be a range of two values, but in the summarized data each entry of this column is one specific value. What does this value mean?

I am sorry if this is a very basic question.

Thank you in advance!

cluhmann · October 29, 2022, 11:27pm

When requesting the 94% HDI (arviz’s current default), there should be 2 columns, one representing the left end of the interval (~~the value below which 3% of the posterior falls~~) and one representing the right end of the interval (~~the value above which 3% of the posterior falls~~). So hdi_3% is the former and there should be a hdi_97% column with the latter.

Ali_Mehrabifard · October 30, 2022, 1:04am

To just add to @cluhmann , I found the following command very handy:

az.hdi(trace,var_names=["XXX"], hdi_prob = 0.80).values()

It will print out the hdi of whatever variables you want from your trace with the specified hdi_prob value.

j_catulo · November 30, 2022, 3:39pm

Hi!

I am sorry (again) if this is a very basic question, but I am really confused. I understand that the HDI gives us the lowest credible interval of a distribution that gives us a specific probability. How do we calculate the HDI for samples from a distribution?

In addition, I cannot understand the difference between a HDI and a confidence interval.

cluhmann · November 30, 2022, 4:57pm

The arviz code used to compute the HDI is here.

The difference between an HDI and a confidence interval is a much longer discussion. I recommend the wikipedia articles here and here.

sboukortt · January 11, 2024, 4:50pm

Sorry, doesn’t this naming assume that the HDI and the equal-tail intervals coincide?

cluhmann · January 11, 2024, 6:35pm

Which naming?

sboukortt · January 11, 2024, 7:17pm

hdi_3% / hdi_97%.

A simple example that shows the issue:

import arviz as az
import pymc as pm

with pm.Model() as model:
    pm.Exponential('exp', lam=.01)
    trace = pm.sample(10000)

print(az.summary(trace))

Output:

       mean      sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
exp  100.48  99.953   0.007  280.653      0.805    0.583   10419.0    9841.0    1.0

But there isn’t 3% of the posterior below 0.007, there is almost 0% of it. 3% of the posterior is under about 3.046. And at the other end of the interval, it’s about 94% of the posterior that’s under 280.653, not 97%. 97% is under ~350.66.

It’s correct as an HDI, but there aren’t 3% of the posterior on each side, contrary to what the names of the columns imply; there is 0% on the left and 6% on the right.

cluhmann · January 11, 2024, 7:23pm

Correct, my initial description of the intervals was for equal-tailed intervals, which HDIs are typically not (but are sometimes!). But worse than that, there are many HDIs (intervals that include N% of the mass) that are not equal-tailed. The arviz hdi function returns exactly one which is the HDI with the minimum width. There are also issues once you are dealing with distributions that are mutlimodal. The bottom line is that saying you are presenting “an HDI” doesn’t really indicate what you have constructed (or how).

sboukortt · January 11, 2024, 8:02pm

As far as I know, this is the defining feature of an HDI (HD = “highest-density”), isn’t it? But thanks for confirming my suspicion about the naming. Would you happen to know of another possible reason why the bounds are named 3% and 97%? Should I perhaps open an issue on ArviZ’s GitHub?

cluhmann · January 11, 2024, 9:19pm

This is what I get for answering quickly. There are not unique credible intervals, the credible intervals of minimum width are highest density intervals and are unique for unimodal distributions (but not otherwise).

You can definitely open an issue to inquire about the interval end naming. I would be curious to see what they say.

ricardoV94 · January 12, 2024, 6:54am

CC @OriolAbril

OriolAbril · January 12, 2024, 9:35am

Yeah, the naming in arviz.summary is not ideal. Not 100% sure why (might have used ETIs at some point or aim to support both or just be a confusion). The hdi function doesn’t use this naming and sets as coordinates 'lower', 'higher'. Not sure if we already have an issue but a PR would be welcome. Thinking out loud, to keep the column names short for the summary dataframe, we might want to use hdi_low and hdi_up?

Note now there is a stats_focus argument (last example in the docstring) and naming for ETI should continue to have the specific probabilities.

bcoueraud87 · July 19, 2024, 12:17pm

I also have the same questions. I looked at the source code and yes ArviZ is computing the HDI correctly, but then the potential issue is in the names as noted by other users. Here is the relevant code:

hdi_post = hdi(dataset, hdi_prob=hdi_prob, multimodal=False, skipna=skipna)
hdi_lower = hdi_post.sel(hdi="lower", drop=True)
hdi_higher = hdi_post.sel(hdi="higher", drop=True)
metrics.extend((mean, sd, hdi_lower, hdi_higher))
metric_names.extend(("mean", "sd", f"hdi_{100 * alpha / 2:g}%", f"hdi_{100 * (1 - alpha / 2):g}%"))

So if you plug in the default value hdi_probe=0.94, then alpha=0.06 and you get the names hdi_3% and hdi_97%. But why are they doing so?

Topic		Replies	Views
Get Percentiles of Posterior Distribution Questions	4	1758	December 3, 2021
Probability contours Questions	3	728	April 7, 2021
Arviz HDI default 94%, any reason?	1	976	August 8, 2023
Plot of mean and HDI is really off Questions	4	1749	October 20, 2020
Need help in plotting HDI from posterior_predictive when Y_obs has 2 dimensions v5 arviz	1	595	September 6, 2023

Summarize inference data (HDI)

Related topics