Metropolis: ideal balance between slow-mixing and high rejection rate?

I’m using PyMC’s Metropolis to sample from a 200-dimensional probability distribution and I want it to converge faster by setting the scale of the proposal distribution “just right”. Here’s what I did:

  1. Lots of trial and error and got an estimation of the variance of the target distribution.
  2. I used it to derive a proposal distribution with balanced scales among dimensions. (So the proposal distribution mimics the proportions of the target distribution).
  3. I calculated the radius of the typical set of a 200 dimensional normal pdf (turns out to be ~14 sigmas)
  4. and used it to normalize the scale of the proposal distribution to compensate for the dimensionality curse pushing my drawed samples too far where the probability is almost zero.
  5. After some more trial and error I figured that the proposal distribution’s scale should be further reduced by a factor of 2-5 to get a rejection rate of about 40-50%

Here’s a plot of the traces of 7 independent chains (projected onto the first two principal components of the whole set of samples):
metropolisMixing

which clearly shows that convergence was not achieved as the 7 chains have not mixed. But also seems that it is exploring the space, just not very fast.
Report:

number of samples = 20000
number of chains =  7
100%|██████████| 20000/20000 \[55:56\<00:00,  5.96it/s\]
The gelman\-rubin statistic is larger than 1.4 for some parameters.
The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.
44% rejections.

I don’t know if 44% rejections is too high or too low. I just have the feeling that maybe there is a way to make this converge faster.

Also, adaptive methods give weird results, so I fixed the proposal distribution and it’s scale.

Here’s my questions:
Is there a recommended rejection rate?
What do you guys think about the sample traces? are they acceptable (qualitatively speaking)?

PS: I run Metropolis again with exactly the same parameters except for the ad hoc scalig factors of item 5 which I removed. I get this traces:
metropolisMixing2
Excuse that the axes metrics can’t be compared mong the two images. And this report

starting metropolis
number of samples = 20000
number of chains =  7
100%|██████████| 20000/20000 [54:42<00:00,  6.09it/s]
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.
75% rejections.

So now I get a much higher rejection rate. And It seems to me that the solution space was poorly explored compared to results above.

Any specific reason you aren’t using NUTS to take care of some of this poor exploration?

Random walk Metropolis is likely inefficient no matter how you tune it due to concentration of measure. There are a lot of resource demonstrating this, eg Blog post from @colcarroll https://colindcarroll.com/2018/01/01/bad-traces-or-dont-use-metropolis/, recent talk from @AustinRochford http://nbviewer.jupyter.org/gist/AustinRochford/ab807b6d3ca64e903f911f3dd33a7044
There are a few other post you can find on this discourse as well.

If for some reason you cannot use NUTS (e.g., no gradient), try the DEMetropolis by @michaelosthege. It handles correlation in parameter space much better.

In this case, you are kind of at the worst situation for Metropolis, you get poor mixing regardless of acceptance rate - high acceptance rate gives correlated trace, low acceptance rate the trace will look stuck.

I can’t provide a gradient and I can’t recall what it was, but there was a problem with NUTS and the way the project was written at that time. The code has seen some change and things run more smooth now but still no gradient.

10⁵ samples in 8 chains with metropolis results in this far-from-converged traces (again with constant ad hoc proposal distributions):
metropolisFail10e5Figure_1

I got DEMetropolis running a few trials and it looks promising. I will compare it with Metropolis with enabled tuning.

This whole thing reveals how much I don’t know, so I’m back to reading Gelman’s book and also the DEMetropolis paper.

2 Likes

I would suggest you give SMC a shot. For a 200d problem I would suggest you use at least 10000 chains.
If you have a computer with many cores this will still take a while but sure you will go somewhere, simply because of the resampling and the annealing. For sure it will be better than Metropolis.

Reading SMC docs here. So I need to evaluate the likelihood, is it enough to know the likelihood up to a constant factor?

Also, the sugestion to use many chains made me realise there is a cores argument for the sampler. I thought that DEMetropolis would split the number of chains into the number of cores so that there would be information interchange intra-core, is this correct? Also I thought that this would make cores number of physical CPU cores load up and work at 100%, but it happens that even if cores=1 all the physical cores are loaded to full capacity. This leaves me confused, is DEM deriving a chain’s jumps from all the remaining chains or just the ones assigned to the corresponding core (assuming chains were split into cores)?

Another thing, so far I had my 200 variables actually defined inside the model as two separate random vectors (for reasons that now I can’t recall, it had to do with something I encountered when I first defined the model). When I look at the traces returned by Metropolis it appears that the rejections/acceptances don’t coincide, in other words, at a given iteration the algorithm might reject proposals for one variable but accept proposals for the other. This means that Metropolis does an iteration on one of the variables leaving the other one fixed? That doesn’t make sense to me.

I now defined the model such that all my variables are inside one long vector. Now DEM gives me some hope that things are working better and convergence is not just an unachievable dream, take a look at this:

DEMtraceXY
DEMtraceX
(chains = 32, cores = 1)

So, no convergence for this short run, but at least it’s going somewhere! Traces of each chain are: first, not stuck (yay!); second, bunching together (not too much mixing though); third, slowly expanding and drifting.

Thanks @junpenglao, @hvasbath, @narendramukherjee for the feedback, this forum/community is of great help to me. I’ll try SMC as the next option if this leads to another dead end.

2 Likes

Dont worry with the model likelihood. It is automatically added in the new version of SMC. The docs are a little outdated with this respect. There might be the fact that your numpy uses internal parallelization?
No that is typical behavior in this Metropolis implementation to only update single variables not all in each step. Which is actually of advantage in your case as you are dealing with such a high-dimensional problem as this increases the likelihood of accepting a step, because you make shorter steps in the n-dimensional space. But if you dont like this behavior you can disable it with giving the “blocked=False” argument to Metropolis instanciation.

Great! I’ll check it out.

My numerical implementation of the up-to-a-factor-probability is not paralelized. And, I’m pretty sure numpy is not doing any multithreading on its own (as far as I know).

If you compiled your numpy locally against for example openblas it will be internally parallel. Most noticable it is in dot products or similar operations.