`HalfNormal` scale of normal observations can cause 100% divergences and strange mass matrices

In a big model that I’m working with, I started to run into 100% divergences when I added some observations. It took me a long while to identify the problem, and in the process I came across something that I found very confusing and, at least in my opinion, worth sharing here.

Consider this simple model that I’ll sample using nutpie directly:

import numpy as np
import nutpie
import pymc as pm
from matplotlib import pyplot as plt

n1 = 150
n2 = 7
coords = {
    "dim1": range(n1),
    "dim2": range(n2),
}
observed = np.random.normal(size=(n1, n2))
observed[:, 0] = 0
with pm.Model(coords=coords) as model:
    sigma = pm.HalfNormal("sigma", dims=["dim2"])
    obs = pm.Normal(
        "obs", mu=0, sigma=sigma, dims=["dim1", "dim2"], observed=observed
    )
compiled = nutpie.compile_pymc_model(model)

trace = nutpie.sample(
    compiled,
    tune=500,
    draws=100,
    chains=4,
    seed=1234,
    return_raw_trace=True,
    store_mass_matrix=True,
)

This model has a single random variable, sigma, that has dim2 shape. The observations for the first entry of dim2 are all set to 0. Doing this, causes 100% divergent samples.

With nutpie’s raw trace, I can grab a bunch of sample stats like the inverse mass matrix or the gradients.

t1 = trace[0][1].to_pandas()
keys = t1[0].keys()
stats = {key: [] for key in keys}
for row in t1:
    for key, val in row.items():
        stats[key].append(val)
for key, val in stats.items():
    stats[key] = np.stack(val)

And if you plot the inverse mass matrix like this:

plt.semilogy(stats["mass_matrix_inv"])
plt.ylabel("Inverse mass matrix diag")
plt.xlabel("draw")

you get

The step sizes all tend to 0:

Just to confirm, the inverse mass matrix that is orders of magnitude bigger than the rest comes from the sigma that had all of its observations set exactly equal to 0

> compiled._coords["unconstrained_parameter"]
Index(['sigma_log___0', 'sigma_log___1', 'sigma_log___2', 'sigma_log___3',
       'sigma_log___4', 'sigma_log___5', 'sigma_log___6'],
      dtype='object')

I can understand why having huge inverse mass matrix diagonal entries causes 100% divergences (any small momentum change produces huge displacement and diverges, just like pushing a very light particle around), but I don’t understand why the mass matrix entry does this for the half normal in this case. It would be awesome if anyone here had some insight into the cause.

— Edit —

I just realized that I could get the plots working with inference data objects instead by doing this:

trace = nutpie.sample(
    compiled,
    tune=500,
    draws=100,
    chains=4,
    seed=1234,
    return_raw_trace=False,
    store_mass_matrix=True,
    save_warmup=True,
)

trace.warmup_sample_stats.mass_matrix_inv.isel(chain=0).plot(x="draw", hue="unconstrained_parameter")

which produces this

To add to the mass matrix strangeness, if you add a mu as a random variable like this:

import numpy as np
import nutpie
import pymc as pm
from matplotlib import pyplot as plt

n1 = 150
n2 = 7
coords = {
    "dim1": range(n1),
    "dim2": range(n2),
}
observed = np.random.normal(size=(n1, n2))
observed[:, 0] = 0
with pm.Model(coords=coords) as model:
    mu = pm.Normal("mu", dims=["dim2"])
    sigma = pm.HalfNormal("sigma", dims=["dim2"])
    obs = pm.Normal(
        "obs", mu=mu, sigma=sigma, dims=["dim1", "dim2"], observed=observed
    )
compiled = nutpie.compile_pymc_model(model)

trace = nutpie.sample(
    compiled,
    tune=500,
    draws=100,
    chains=4,
    seed=1234,
    return_raw_trace=False,
    store_mass_matrix=True,
    save_warmup=True,
)

You still get many divergences (not 100%), but the problem here seems to be the mass matrix of mu for the set of observations that are equal to 0:

So in the first model, pinning the mean value to 0 makes the sampler learn a very low mass for the corresponding sigma, but when the mean is free to be learnt, the mean ends up getting a huge mass. That makes the step size go towards zero and also produces divergent samples, but it’s less prone to divergences that low masses.

I don’t understand the surprise? The posterior is basically consistent with a sigma of zero, which is near-impossible to sample?

I can get PyMC NUTS to fail as well:

import pymc as pm
with pm.Model() as m:
    sigma = pm.HalfNormal("sigma")
    mu = pm.Normal("mu")
    pm.Normal("y", mu, sigma, observed=np.ones(750,))
    pm.sample()

I would like to understand why the mass matrix behaves as it does so that I can build an intuition. Why in one case I get 100% divergences and potentially huge inverse masses and in a slightly different case I get some divergences, hit the max tree depth and one inverse mass that’s nearly zero.
I’m kind of surprised by this spread of potential problems and would like to be able to diagnose them in more nuanced models

Let me try to expand on @ricardoV94’s correct answer.

I don’t know PyMC that well, but I’m guessing* this code produces the following density (replacing dim1 with M, dim2 with N, and obs with y to put it into standard Bayesian notation),

\sigma_n \sim \text{normal}^+(0, 1) for 0 \leq n < N

y_{m, \, n} \sim \text{normal}(0, \sigma_n) for 0 \leq m < M and 0 \leq n < N

In which case, you have y_{m, 0} = 0 for all m, which is consistent with \sigma_0 = 0, which is at the boundary of parameter space. This is almost always a problem. The positive-constrained parameter \sigma_0 gets log transformed to an unconstrained \log \sigma, which then gets driven to -\infty. The halfnormal prior is also consistent with \sigma = 0—in fact that’s the mode of the distribution. The posterior exists in theory, but it’s going to be very hard to sample no matter what you do, especially if M is large.

It might help to put a zero-avoiding prior on \sigma like a lognormal, that assigns zero density at zero (log density goes to negative infinity).


* I’m curious how PyMC does the matching—is it name or value based or something else?

Matching is done positionally based following numpy broadcasting rules.

Dims are used just as labels / to initialize the shape of the RVs: Distribution Dimensionality — PyMC 5.21.0 documentation