Why is sampling sometimes slow?

When I sample the posterior of my rather complex model (via NUTS), I see a strange behavior. Typically one chain will sample quite slowly, and the other 3 will sample quickly, at the speed I have come to expect.

Why is this occurring? Is the slow chain stuck in a place with difficult geometries? If so, how can I discover where the geometries are difficult, and perhaps refactor the model?

When this happens to me, it’s because the model is not fitting well. I see in the bottom chain that you have at least one divergence. And the bottom three chains are sampling very slowly, too. I don’t know if it is too slow because I don’t know the complexity of your model or the size of your dataset. But two draws per second seems slow to me.

So, why is it occurring? The data may not fit the model well.

Every model is different, so it’s hard to give general advice. Since there’s differences across chains, it could be that one chain is being initialized in a region of parameter space that’s problematic for some reason. That can be remedied by using more informative priors, to rule out “bad” (“anti-physical”) solutions.

My step 1 would be to do some prior predictive simulations, and make sure that the model’s initial state isn’t producing absolute garbage.

Beyond that, there are the usual “geometry” issues, related to the identifibility of parameters. Often combinations of parameters. This is a much more difficult issue that definitely doesn’t have a one-line suggestion.

I agree with @Nomadiq , who seems to be hinting at Andrew Gelman’s “folk theorem of Bayesian computing”: When you have computational problems, often there’s a problem with your model.

Thanks. Will start by re-examining prior predictives.

This behavior is usually caused by one of the chains finding a region of high curvature during warmup, then setting a very small step size in order to sample that region. This leads to requiring very many steps to hit a U-turn in lower curvature areas of the posterior. If you remove the slow chain(s), it induces bias in estimates away from the regions of high curvature. Whether you can live with that amount of bias is another question, but it’s hard to estimate how much bias there is without sampling.

You can develop intuitions in a multivariate normal with different scales for each variable and no correlation. It has to set a small step size to deal with the smallest scale and then take a number of steps proportional to the largest scale divided by the smallest scale to traverse the largest scale parameter. This is usually formalized through the notion of condition, which is the ratio of the largest eigenvalue to the smallest in the Hessian of the negative log density. In the multivariate normal example, this is the ratio of the largest scale divided by the smallest scale, squared. The leapfrog algorithm requires step sizes smaller than 2 / \sqrt{\lambda^\textrm{max}}, where \lambda^\textrm{max} is the max eigenvalue, though that’s right at the boundary of stability for the algorithm and step sizes about half that maximum stable value tend to work better. These estimates are the basis for the step size tuning in Matt Hoffman and Pavel Sountsov’s parallel samplers.

3 Likes

@bob-carpenter: If I understand your explanation—and I may not—you imply that differences in scale for RVs should be avoided. For example, the following model may give sampling problems:

with pm.Model():
x = pm.Normal("x", mu=0, sigma=1.0)
y = pm.Normal("y", mu=1_000_000, sigma=1_000_000_000)

Is that correct? Is there a need to noncenter even when there is nothing hierarchical?

It’s more of a problem when the bad conditioning comes from correlation than simply from varying scales. Consider these two ways to get thin cigars out of 2D normals with bad conditioning:

  1. A highly correlated 2D normal, e.g., normal(0, [[1, 0.99], [0.99, 1]]).
  2. A differently scaled 2D normal, e.g., normal(0, [[4, 0], [0, 0.01]]).

Both give you long narrow regions with mismatched scales. The axis-aligned version (2) is easier to adjust for with a diagonal mass matrix, which is what PyMC is going to use by default. So (2) shouldn’t be a problem, but (1) will be.

It’s mainly (1) that you are adjusting with non-centered parameterizations.

1 Like