About sampling start with find_MAP

I have been looking for solutions to my model because didn’t give good convergence. When I was searching for information I saw some people using

start = pm.find_MAP()
train_trace = pm.sample( train_samp, tune= train_tune, start=start, njobs=4 )

and on the documentation it states

map : Use the MAP as starting point. This is discouraged.

My questions are

  1. Is pm.sample( …,start=start, …) equivalent to pm.sample( …,init= ‘map’, …)?
  2. Why is it discouraged?
    I tried with and without start= pm.find_MAP(), and the convergence with the later one is improved a lot. But then I saw the documentation that it’s not suggested… I am wondering if there was just luck or something I missed or problems I didn’t know…

Yes

The short answer is that in high dimension, the MAP is usually not in the typical set (where most of the volume of the posterior parameter space). There are quite a lot of resource explaining this (e.g., concentration of measure in high dimension etc).

It depends on how you measure convergence. In some low dimensional cases, it might be faster as the MAP is indeed in the typical set. However, in most (high dimension) cases, if using find_MAP() you get a nice trace it usually means there are latent problems with your model (One scenario could be that there are multimode, and using MAP as start the trace is stucked in just one of them).
In general, the default works better in a sense that, if it is slow or not converging you can be pretty sure that there are either problem of your model, or places could be optimized.

Thanks a lot for the explanation. :smiley:
I would like to ask about the documentation

For almost all continuous models, NUTS should be preferred. There are hard-to-sample models for which NUTS will be very slow causing many users to use Metropolis instead. This practice, however, is rarely successful. NUTS is fast on simple models but can be slow if the model is very complex or it is badly initialized. In the case of a complex model that is hard for NUTS, Metropolis, while faster, will have a very low effective sample size or not converge properly at all. A better approach is to instead try to improve initialization of NUTS, or reparameterize the model.

Does initialization of NUTS mean something like

start = pm.sampling.init_nuts( init=‘nuts’,…)
pm.sample( …, start=start , …)

But wouldn’t it also be a problem if the model has multimode?

The initialization of NUTS means something like:

step = pm.NUTS(... #initial parameters)

There are many parameters in NUTS, for example, the step size, the mass matrix, the parameter for dual averaging for step size… Some of these parameters will be set if you are calling the step method directly as above, which is what we refer to as initialization. As a result, in general you should not initialized the step method as above.

Instead, what we recommend is to use the init=... kwarg in pm.sample() to do the initialization. The main thing to point out is that, although internally it is calling the function pm.sampling.init_nuts( init=‘jitter+adapt_diag’,…), it is returning not only the start point but more importantly a scheme for tunning the mass matrix of NUTS during some tunning steps.

There is always a problem if the modes are well separated. But I think in practice, usually the modes are not as well separate and a good initialization could give appropriate scaling to the NUTS kernel - the random velocity is then have enough energy to go from one mode to the other.

Thank you for the reply!! :slight_smile:

In the documentation

pymc3.sampling.sample(draws=500, step=None, init=‘auto’, n_init=200000, …)

Is n_init suggested to be a great number because of some reason? Because it takes quite a long time when I have 800 datapoints 5 nodes in input layer , 5 nodes hidden layer… I was wondering if I can freely choose a number or if I should take something into consideration when I change it.
Thank a lot. :slight_smile:

So I take it you are using advi for initialization? Usually it stop well before 200000 (we have some simple heuristic to check if it converges and do an early stop if so)

No I was using nuts, I though it should take the value of draw, but it seems to take 200000 if no value is passed into the function.

Oh you are using init=nuts? in that case yes you can reduce the number of n_init. But I dont think it usually works too well. The default init=jitter+adapt_diag is usually recommended

I came here to see what PyMC’s current suggestion was, because I just wrote a blog post about initializing at the mode on Andrew Gelman’s blog where I tried initialization at the mode for a 10K-dimensional standard normal. Summary: it’s a bad idea, but NUTS can compensate with its default warmup.

First, there may not be a mode. In hierarchical models, you don’t get modes unless you have specialized zero-avoiding priors.

To add a little more detail to what @junpenglao said, the problem isn’t just that the mode is atypical. We start with random inits in the tails all the time. The difference is that the mode has the maximum density and so any proposal is by definition of a lower density state. The density usually drops off very fast moving away from the mode in high dimensions, making the acceptance rate basically zero for any reasonably large step size. This is a problem for HMC, for the original NUTS, and for the multinomial form of NUTS introduced by Michael Betancourt and currently used in Stan.

Luckily, the default step size tuning in NUTS will compensate by starting with a small step size in Phase I of warmup and then finishing with a reasonable step size in Phase III of warmup before sampling. (This is assuming you do the warmup similarly to how we do it in Stan or have another means of ongoing stepsize adaptation).

6 Likes