Don't Use Metropolis

Just wrote a new short post on why you should not use the Metropolis sampler. I compare a model sampled with the No U-Turn sampler to the same model sampled with Metropolis, and show how you might notice the sampler did a bad job.

https://colindcarroll.com/2018/01/01/bad-traces-or-dont-use-metropolis/

4 Likes

Great post, a couple of comments below. So I’m not an expert, but IIRC I think everyone agrees that NUTS, HMC > Metropolis, at least in the case of continuous variables. To me, some important followup questions are: 1. what about discrete and/or mixed models? 2. How much better is NUTS vs HMC?

Yep, I think those are more interesting questions. I see a lot of questions in issues and on stack overflow where users manually assign Metropolis as a step method, and this was mostly to point at as evidence that you should not do that unless you have a great reason to do so. One of my favorite textbook series are the “Counterexamples in X”, and this was an attempt to make a concrete example of where Metropolis fails.

I think your two questions would be particularly interesting if you could find examples that were surprising:

  1. PyMC3 automatically will use NUTS for anything continuous, and Metropolis for discrete, and I do not know any situation where using Metropolis for all the variables would do better.

  2. You would only use HMC if you wanted to tune a few extra variables by hand. I would be interested to see a situation in which hand-tuned HMC does better enough to justify the hand tuning.

We might also want to revisit the comparison between hmc and nuts when #2677 is merged. Right now hmc doesn’t have mass matrix adaptation and dual averaging for the step size. From some profiling it looks like in some cases there is noticeable overhead in all the bookkeeping necessary for nuts, and that wouldn’t be a problem for hmc.

Your comment matches my own experience in molecular simulation, where the improved sampling of NUTS-like methods just never seemed to compensate for their greater expense. Of course, the log-likelihoods are quite different, as are the relative costs of likelihood and gradient calculations.

A bit unrelated, but I just tried 3.9 version and noticed that chains look very noisy in model with Metropolis(), exact same model was calculated with pymc3 version 3.8 and chains were much ‘smoother’.