Don't Use Metropolis

Just wrote a new short post on why you should not use the Metropolis sampler. I compare a model sampled with the No U-Turn sampler to the same model sampled with Metropolis, and show how you might notice the sampler did a bad job.

https://colindcarroll.com/2018/01/01/bad-traces-or-dont-use-metropolis/

4 Likes

Great post, a couple of comments below. So I’m not an expert, but IIRC I think everyone agrees that NUTS, HMC > Metropolis, at least in the case of continuous variables. To me, some important followup questions are: 1. what about discrete and/or mixed models? 2. How much better is NUTS vs HMC?

Yep, I think those are more interesting questions. I see a lot of questions in issues and on stack overflow where users manually assign Metropolis as a step method, and this was mostly to point at as evidence that you should not do that unless you have a great reason to do so. One of my favorite textbook series are the “Counterexamples in X”, and this was an attempt to make a concrete example of where Metropolis fails.

I think your two questions would be particularly interesting if you could find examples that were surprising:

  1. PyMC3 automatically will use NUTS for anything continuous, and Metropolis for discrete, and I do not know any situation where using Metropolis for all the variables would do better.

  2. You would only use HMC if you wanted to tune a few extra variables by hand. I would be interested to see a situation in which hand-tuned HMC does better enough to justify the hand tuning.

We might also want to revisit the comparison between hmc and nuts when #2677 is merged. Right now hmc doesn’t have mass matrix adaptation and dual averaging for the step size. From some profiling it looks like in some cases there is noticeable overhead in all the bookkeeping necessary for nuts, and that wouldn’t be a problem for hmc.

Your comment matches my own experience in molecular simulation, where the improved sampling of NUTS-like methods just never seemed to compensate for their greater expense. Of course, the log-likelihoods are quite different, as are the relative costs of likelihood and gradient calculations.

A bit unrelated, but I just tried 3.9 version and noticed that chains look very noisy in model with Metropolis(), exact same model was calculated with pymc3 version 3.8 and chains were much ‘smoother’.

To be fair, our implementation of Metropolis was unnecessarily bad for batched distributions such as the state_rate in that model. Wonder how Metropolis would fare after the change in https://github.com/pymc-devs/pymc/pull/5823

Ok, you interested me, and I got the notebook running again, but you’ll have to trust me on the results. I ran this on colab, using pymc 4.0.1. Colab means that the sampling was done sequentially, and

Timing

This is the median, as reported by pymc:

  • Metropolis: 1:02 / chain
  • NUTS: 1:02 / chain

ESS

Cast to int for readability, and note that we use an updated ESS calculation, so it can be greater than 40k here.

  • Metropolis:
{'pooled_rate': array(1389),
 'state_rate': array([3586, 1681, 1761,  659, 1673, 2489, 3077, 4167,  890, 1029, 3397,
        2259, 1145, 1633, 2183, 1945, 1591, 1371, 2564, 1779, 2071, 1168,
        1571, 1514, 1194, 2385, 2533, 2251, 3503, 1544, 1843,  955, 1252,
        3506, 1097, 1731, 1945,  992, 3119, 1115, 2939, 1133,  792, 2446,
        4205, 1501, 1647, 1992, 1449, 3325])}
  • NUTS
{'pooled_rate': array(79228),
 'state_rate': array([104168, 101727, 105899, 105546,  97935, 106397,  97709,  98964,
         92776, 101145, 101708,  89374, 101525,  97536,  99377,  99805,
        100772,  97263,  90646,  94253, 103632, 104125,  99385,  99157,
         99894, 100388,  98847, 101918,  96794, 100890,  92974, 104452,
         94207,  96887, 104427, 104683,  89838, 104355,  98145, 104004,
        103054,  94856,  95218, 104233,  97423, 103515,  98367,  97032,
        102729, 105320])}

As a side note, Metropolis sampling emits this warning:

/usr/local/lib/python3.7/dist-packages/pymc/step_methods/metropolis.py:289: RuntimeWarning: overflow encountered in exp
  "accept": np.mean(np.exp(self.accept_rate_iter)),

Totally convincing! Is the Metropolis doing better than before though?

Yes – wall time is slower, but that’s comparing sequential Colab execution to a (I think!) 2016 macbook pro, and the ESS/S is also faster. Here are the old metropolis results:

{'pooled_rate': 3.0,
     'state_rate': array([ 13.,   2.,   6.,   5.,  29.,   8.,   7.,   6.,   2.,  45.,  48.,
             12.,   6.,  28.,   7.,   2.,   2.,   6.,  17.,   2.,   4.,  17.,
             10.,  11.,  12.,  13.,   2.,  12.,   2.,   7.,   5.,   4.,  35.,
             31.,   2.,  21.,   5.,   2.,  25.,   2.,   9.,   2.,  26.,  24.,
              2.,   2.,   2.,   8.,   2.,  10.,   2.])}

The old NUTS results were 4,000 for everything (but I would guess that ran into the old ESS implementation that didn’t allow more ESS than samples, so who knows).

1 Like