`HalfNormal` scale of normal observations can cause 100% divergences and strange mass matrices

lucianopaz · February 24, 2025, 12:10pm

In a big model that I’m working with, I started to run into 100% divergences when I added some observations. It took me a long while to identify the problem, and in the process I came across something that I found very confusing and, at least in my opinion, worth sharing here.

Consider this simple model that I’ll sample using nutpie directly:

import numpy as np
import nutpie
import pymc as pm
from matplotlib import pyplot as plt

n1 = 150
n2 = 7
coords = {
    "dim1": range(n1),
    "dim2": range(n2),
}
observed = np.random.normal(size=(n1, n2))
observed[:, 0] = 0
with pm.Model(coords=coords) as model:
    sigma = pm.HalfNormal("sigma", dims=["dim2"])
    obs = pm.Normal(
        "obs", mu=0, sigma=sigma, dims=["dim1", "dim2"], observed=observed
    )
compiled = nutpie.compile_pymc_model(model)

trace = nutpie.sample(
    compiled,
    tune=500,
    draws=100,
    chains=4,
    seed=1234,
    return_raw_trace=True,
    store_mass_matrix=True,
)

This model has a single random variable, sigma, that has dim2 shape. The observations for the first entry of dim2 are all set to 0. Doing this, causes 100% divergent samples.

With nutpie’s raw trace, I can grab a bunch of sample stats like the inverse mass matrix or the gradients.

t1 = trace[0][1].to_pandas()
keys = t1[0].keys()
stats = {key: [] for key in keys}
for row in t1:
    for key, val in row.items():
        stats[key].append(val)
for key, val in stats.items():
    stats[key] = np.stack(val)

And if you plot the inverse mass matrix like this:

plt.semilogy(stats["mass_matrix_inv"])
plt.ylabel("Inverse mass matrix diag")
plt.xlabel("draw")

you get

The step sizes all tend to 0:

Just to confirm, the inverse mass matrix that is orders of magnitude bigger than the rest comes from the sigma that had all of its observations set exactly equal to 0

> compiled._coords["unconstrained_parameter"]
Index(['sigma_log___0', 'sigma_log___1', 'sigma_log___2', 'sigma_log___3',
       'sigma_log___4', 'sigma_log___5', 'sigma_log___6'],
      dtype='object')

I can understand why having huge inverse mass matrix diagonal entries causes 100% divergences (any small momentum change produces huge displacement and diverges, just like pushing a very light particle around), but I don’t understand why the mass matrix entry does this for the half normal in this case. It would be awesome if anyone here had some insight into the cause.

— Edit —

I just realized that I could get the plots working with inference data objects instead by doing this:

trace = nutpie.sample(
    compiled,
    tune=500,
    draws=100,
    chains=4,
    seed=1234,
    return_raw_trace=False,
    store_mass_matrix=True,
    save_warmup=True,
)

trace.warmup_sample_stats.mass_matrix_inv.isel(chain=0).plot(x="draw", hue="unconstrained_parameter")

which produces this

lucianopaz · February 24, 2025, 12:47pm

To add to the mass matrix strangeness, if you add a mu as a random variable like this:

import numpy as np
import nutpie
import pymc as pm
from matplotlib import pyplot as plt

n1 = 150
n2 = 7
coords = {
    "dim1": range(n1),
    "dim2": range(n2),
}
observed = np.random.normal(size=(n1, n2))
observed[:, 0] = 0
with pm.Model(coords=coords) as model:
    mu = pm.Normal("mu", dims=["dim2"])
    sigma = pm.HalfNormal("sigma", dims=["dim2"])
    obs = pm.Normal(
        "obs", mu=mu, sigma=sigma, dims=["dim1", "dim2"], observed=observed
    )
compiled = nutpie.compile_pymc_model(model)

trace = nutpie.sample(
    compiled,
    tune=500,
    draws=100,
    chains=4,
    seed=1234,
    return_raw_trace=False,
    store_mass_matrix=True,
    save_warmup=True,
)

You still get many divergences (not 100%), but the problem here seems to be the mass matrix of mu for the set of observations that are equal to 0:

So in the first model, pinning the mean value to 0 makes the sampler learn a very low mass for the corresponding sigma, but when the mean is free to be learnt, the mean ends up getting a huge mass. That makes the step size go towards zero and also produces divergent samples, but it’s less prone to divergences that low masses.

ricardoV94 · February 24, 2025, 2:50pm

I don’t understand the surprise? The posterior is basically consistent with a sigma of zero, which is near-impossible to sample?

I can get PyMC NUTS to fail as well:

import pymc as pm
with pm.Model() as m:
    sigma = pm.HalfNormal("sigma")
    mu = pm.Normal("mu")
    pm.Normal("y", mu, sigma, observed=np.ones(750,))
    pm.sample()

lucianopaz · February 24, 2025, 3:51pm

I would like to understand why the mass matrix behaves as it does so that I can build an intuition. Why in one case I get 100% divergences and potentially huge inverse masses and in a slightly different case I get some divergences, hit the max tree depth and one inverse mass that’s nearly zero.
I’m kind of surprised by this spread of potential problems and would like to be able to diagnose them in more nuanced models

bob-carpenter · February 28, 2025, 6:22pm

Let me try to expand on @ricardoV94’s correct answer.

lucianopaz:

observed = np.random.normal(size=(n1, n2))
observed[:, 0] = 0
with pm.Model(coords=coords) as model:
    sigma = pm.HalfNormal("sigma", dims=["dim2"])
    obs = pm.Normal(
        "obs", mu=0, sigma=sigma, dims=["dim1", "dim2"], observed=observed

I don’t know PyMC that well, but I’m guessing^* this code produces the following density (replacing dim1 with M, dim2 with N, and obs with y to put it into standard Bayesian notation),

\sigma_n \sim \text{normal}^+(0, 1) for 0 \leq n < N

y_{m, \, n} \sim \text{normal}(0, \sigma_n) for 0 \leq m < M and 0 \leq n < N

In which case, you have y_{m, 0} = 0 for all m, which is consistent with \sigma_0 = 0, which is at the boundary of parameter space. This is almost always a problem. The positive-constrained parameter \sigma_0 gets log transformed to an unconstrained \log \sigma, which then gets driven to -\infty. The halfnormal prior is also consistent with \sigma = 0—in fact that’s the mode of the distribution. The posterior exists in theory, but it’s going to be very hard to sample no matter what you do, especially if M is large.

It might help to put a zero-avoiding prior on \sigma like a lognormal, that assigns zero density at zero (log density goes to negative infinity).

^* I’m curious how PyMC does the matching—is it name or value based or something else?

ricardoV94 · February 28, 2025, 6:43pm

Matching is done positionally based following numpy broadcasting rules.

Dims are used just as labels / to initialize the shape of the RVs: Distribution Dimensionality — PyMC 5.21.0 documentation

Topic		Replies	Views
Divergences with unused variables	11	569	November 16, 2023
Chain failed due to mass matrix going to zero in matrix normal distribution Questions	2	611	March 26, 2021
Mass matrix contains zeros on the diagonal for messy data Questions	4	518	July 23, 2021
Reducing divergences : Implementation challenges Questions	7	2096	August 3, 2021
Problem with divergence in "localized" models: extreme values for Half Cauchy variables Questions	1	632	June 8, 2019

`HalfNormal` scale of normal observations can cause 100% divergences and strange mass matrices

Related topics