map : Use the MAP as starting point. This is discouraged.
My questions are
Is pm.sample( …,start=start, …) equivalent to pm.sample( …,init= ‘map’, …)?
Why is it discouraged?
I tried with and without start= pm.find_MAP(), and the convergence with the later one is improved a lot. But then I saw the documentation that it’s not suggested… I am wondering if there was just luck or something I missed or problems I didn’t know…
The short answer is that in high dimension, the MAP is usually not in the typical set (where most of the volume of the posterior parameter space). There are quite a lot of resource explaining this (e.g., concentration of measure in high dimension etc).
It depends on how you measure convergence. In some low dimensional cases, it might be faster as the MAP is indeed in the typical set. However, in most (high dimension) cases, if using find_MAP() you get a nice trace it usually means there are latent problems with your model (One scenario could be that there are multimode, and using MAP as start the trace is stucked in just one of them).
In general, the default works better in a sense that, if it is slow or not converging you can be pretty sure that there are either problem of your model, or places could be optimized.
Thanks a lot for the explanation.
I would like to ask about the documentation
For almost all continuous models, NUTS should be preferred. There are hard-to-sample models for which NUTS will be very slow causing many users to use Metropolis instead. This practice, however, is rarely successful. NUTS is fast on simple models but can be slow if the model is very complex or it is badly initialized. In the case of a complex model that is hard for NUTS, Metropolis, while faster, will have a very low effective sample size or not converge properly at all. A better approach is to instead try to improve initialization of NUTS, or reparameterize the model.
There are many parameters in NUTS, for example, the step size, the mass matrix, the parameter for dual averaging for step size… Some of these parameters will be set if you are calling the step method directly as above, which is what we refer to as initialization. As a result, in general you should not initialized the step method as above.
Instead, what we recommend is to use the init=... kwarg in pm.sample() to do the initialization. The main thing to point out is that, although internally it is calling the function pm.sampling.init_nuts( init=‘jitter+adapt_diag’,…), it is returning not only the start point but more importantly a scheme for tunning the mass matrix of NUTS during some tunning steps.
There is always a problem if the modes are well separated. But I think in practice, usually the modes are not as well separate and a good initialization could give appropriate scaling to the NUTS kernel - the random velocity is then have enough energy to go from one mode to the other.
Is n_init suggested to be a great number because of some reason? Because it takes quite a long time when I have 800 datapoints 5 nodes in input layer , 5 nodes hidden layer… I was wondering if I can freely choose a number or if I should take something into consideration when I change it.
Thank a lot.