I was wondering why the Metropolis routine implemented in PyMC3 does not have adaptive schemes? (I know Metropolis is not the recommended default sampler to use, but I am using it because I have an external model with 4 parameters that I use with a Theano
as_op that has no gradient).
I came across a few references (DRAM: Efficient adaptive MCMC), and a MATLAB toolbox that uses such schemes and that performs really well in my cases (= converges in a few dozen steps and then with 2000-5000 samples it’s enough), where PyMC3 fails me (= the estimated parameters are barely in the 2 sigma range after 100000s of samples).
As the author of the aforementioned toolbox puts it “the covariance matrix of the proposal distribution can be adapted during the simulation according to adaptive schemes described in the references”, and I am pretty sure this is why these methods converge really fast and sample well where the standard Metropolis algorithm implemented in PyMC3 fails me (I of course use the same external model and Gaussian priors in both).
These methods have been available for over 10 years and so I was wondering what was the reason they were not implemented in PyMC? I am no expert in the field, I am just trying to use modern tools to do parameter estimation, so I may be wrong in my interpretation of what is recommended and what is available in PyMC, but since I see that PyMC is often on the cutting edge of research, sometimes implementing methods very soon after a paper is published, I am curious .
You can try the DEMetropolis which suppose to also works well with correlated posterior.
Thanks for the reply, and I indeed saw that
DEMetropolis was implemented in 3.3, but some weird bugs on 3.3 showed up that prompted me to stay on 3.2 a few months back. Maybe I’ll give it a try again. (It is also not adaptive, but another variation of MCMC mixed with DE, but if it works, why not).
I didn’t realize that this method was also from 2006 like DRAM that I mentioned, so I guess it’s just that nobody was interested in implementing such methods before recently. Is the lack of time/interest from developers the answer to my question also (the question being why adaptive methods are not yet implemented)? I just wanted to make sure it wasn’t some other deeper mathematical reason that makes them not good in the end or something along those lines.
In high dimension, gradient is almost our only hope. So methods like DRAM is useful probably only for some very specific model, and we did not see a big need of implementing it.
Of course contribution is always welcome It seems both DR and AM are possible to implement on top of the current Metropolis sampler (put the AM in the tuning stage, and DR in the sampler
astep method). So I encourage you to start a PR/project, and we will be happy to help/review.
Thank you for the precisions. Well in my case I will never have a gradient so I have to make do with what I have . I’ll probably end up having more parameters too so we’ll see. I saw this other thread with someone sampling 200 parameters and using Metropolis. Hopefully I’ll never have that many…
When you say almost our only hope, what other hopes are there? SMC?
And for the PR, I don’t think I’ll be able to do it as of now, but maybe in the future when I understand more.
SMC is surely one of the solution. But personally, when gradient is not available you need to very carefully study your system and model, design specific inference, and carefully verify your result. You dont want to end up with 5 million samples and effective sample size is still just 66.
Thanks for that input, I’ll be sure to check what I get with some common diagnostics tools such as the ones provided by PyMC.
I am not sure what you mean with adaptive here? In my field adaptation in the metropolis sampler is tuneing the step size based on the acceptance ratio, which is exactly what is implemented in the current Metropolis as well as in SMC. What does adaptation mean in your field?
Well, in the paper I provided you can see that the adaptation I refer to is described as DRAM which is as AM+DR, with:
The intuition behind the AM [Adaptive Metropolis] is that, on-line tuning the proposal distribution in a MH can be based on the past sample path of the chain. Due to this form of adaptation the resulting sampler is neither Markovian nor reversible. In Haario et al. (2001) the authors prove, from first principles, that, under some regularity conditions on the way adaptation is performed and if the target distribution is bounded on a bounded support, the AM retains the desired stationary distribution.
for instance the sigma of the Gaussian proposal distribution is tweaked during the chain.
Delayed rejection (DR) is a way of modifying the standard Metropolis-Hastings algorithm (MH) (Tierney, 1994) to improve efficiency of the resulting MCMC estimators relative to Peskun (1973) and Tierney (1998) asymptotic variance ordering. The basic idea is that, upon rejection in a MH, instead of advancing time and retaining the same position, a second stage move is proposed. The acceptance probability of the second stage candidate is computed so that reversibility of the Markov chain relative to the distribution of interest, π, is preserved. The process of delaying rejection can be iterated for a fixed or random number of stages. Higher stage proposals are allowed to depend on the candidates so far proposed and rejected. Thus DR allows partial local adaptation of the proposal within each time step of the Markov chain still retaining the Markovian property and reversibility.
So what they propose with DRAM is:
The Adaptive Metropolis (AM) algorithm is the global adaptive strategy we will combine with the local adaptive strategy provided by the DR.
and in my case, the toolbox using this concept that I linked earlier works better than the current standard Metropolis implementation in PyMC. Upon close inspection of the code, I believe this is mostly due to the AM part tweaking the sigma of the distribution that gets me ‘unstuck’, whereas I would stay stuck on PyMC.
Actually I also just tried
DEMetropolis on my problem and it doesn’t really give me better results either on my problem, i.e. I barely get the values of the parameter in the 2 sigma range after 100000s of samples.
I should still try an
SMC to check, but if you have otehr propositions I’m all ears .
I see. Thanks for the clarification! Yes indeed that is missing, simply because people get too used to having the gradient and simply busting everything with NUTS. I have the same problem as you do for my problems. There is also no gradient, even the maths is not worked out yet- so it will take some more decades I guess. Which is why I went with SMC and it worked much much better than Metropolis at least.