map : Use the MAP as starting point. This is discouraged.
My questions are
Is pm.sample( …,start=start, …) equivalent to pm.sample( …,init= ‘map’, …)?
Why is it discouraged?
I tried with and without start= pm.find_MAP(), and the convergence with the later one is improved a lot. But then I saw the documentation that it’s not suggested… I am wondering if there was just luck or something I missed or problems I didn’t know…
The short answer is that in high dimension, the MAP is usually not in the typical set (where most of the volume of the posterior parameter space). There are quite a lot of resource explaining this (e.g., concentration of measure in high dimension etc).
It depends on how you measure convergence. In some low dimensional cases, it might be faster as the MAP is indeed in the typical set. However, in most (high dimension) cases, if using find_MAP() you get a nice trace it usually means there are latent problems with your model (One scenario could be that there are multimode, and using MAP as start the trace is stucked in just one of them).
In general, the default works better in a sense that, if it is slow or not converging you can be pretty sure that there are either problem of your model, or places could be optimized.
Thanks a lot for the explanation.
I would like to ask about the documentation
For almost all continuous models, NUTS should be preferred. There are hard-to-sample models for which NUTS will be very slow causing many users to use Metropolis instead. This practice, however, is rarely successful. NUTS is fast on simple models but can be slow if the model is very complex or it is badly initialized. In the case of a complex model that is hard for NUTS, Metropolis, while faster, will have a very low effective sample size or not converge properly at all. A better approach is to instead try to improve initialization of NUTS, or reparameterize the model.
There are many parameters in NUTS, for example, the step size, the mass matrix, the parameter for dual averaging for step size… Some of these parameters will be set if you are calling the step method directly as above, which is what we refer to as initialization. As a result, in general you should not initialized the step method as above.
Instead, what we recommend is to use the init=... kwarg in pm.sample() to do the initialization. The main thing to point out is that, although internally it is calling the function pm.sampling.init_nuts( init=‘jitter+adapt_diag’,…), it is returning not only the start point but more importantly a scheme for tunning the mass matrix of NUTS during some tunning steps.
There is always a problem if the modes are well separated. But I think in practice, usually the modes are not as well separate and a good initialization could give appropriate scaling to the NUTS kernel - the random velocity is then have enough energy to go from one mode to the other.
Is n_init suggested to be a great number because of some reason? Because it takes quite a long time when I have 800 datapoints 5 nodes in input layer , 5 nodes hidden layer… I was wondering if I can freely choose a number or if I should take something into consideration when I change it.
Thank a lot.
So I take it you are using advi for initialization? Usually it stop well before 200000 (we have some simple heuristic to check if it converges and do an early stop if so)
Oh you are using init=nuts? in that case yes you can reduce the number of n_init. But I dont think it usually works too well. The default init=jitter+adapt_diag is usually recommended
I came here to see what PyMC’s current suggestion was, because I just wrote a blog post about initializing at the mode on Andrew Gelman’s blog where I tried initialization at the mode for a 10K-dimensional standard normal. Summary: it’s a bad idea, but NUTS can compensate with its default warmup.
First, there may not be a mode. In hierarchical models, you don’t get modes unless you have specialized zero-avoiding priors.
To add a little more detail to what @junpenglao said, the problem isn’t just that the mode is atypical. We start with random inits in the tails all the time. The difference is that the mode has the maximum density and so any proposal is by definition of a lower density state. The density usually drops off very fast moving away from the mode in high dimensions, making the acceptance rate basically zero for any reasonably large step size. This is a problem for HMC, for the original NUTS, and for the multinomial form of NUTS introduced by Michael Betancourt and currently used in Stan.
Luckily, the default step size tuning in NUTS will compensate by starting with a small step size in Phase I of warmup and then finishing with a reasonable step size in Phase III of warmup before sampling. (This is assuming you do the warmup similarly to how we do it in Stan or have another means of ongoing stepsize adaptation).