Frequently Asked Questions

  • Is find_MAP still encouraged in practice? i notice a lot of code samples stopped using find_MAP (asked by @kpei)

find_MAP is pretty much discouraged now in most situations. The MAP is not a representative point in high dimensions, because the MAP is usually not in the typical set (Betancourt’s paper is the best reference Moreover, it can make life difficult for NUTS, as the gradients are small at the MAP.

  • Is NUTS sampler with ADVI always encouraged? I know that it only supports continuous (asked by @kpei)

Use NUTS when you can, it scales reasonably in higher dimensions. How to properly initialize it is a bit of an open question right now, but advi is probably the best we have at the moment: it gives a very good scaling matrix for NUTS. And in general, the estimation from ADVI is quite close to the posterior from MCMC (you can even get better result using FullRankADVI)
[EDIT]: We now have a new default for initialising NUTS, which is close to what STAN does by adapting the mass matrix during tuning.


  • I see a flat posterior distribution using ADVI, what is happening?

This is usually an indication that the gradient is zero for that random variable. It happens if some of the manipulation of the random variables breaks the smooth geometry property. Even if all your random variables are continuous, some operation will break the gradient, for example:

. casting a variable to int: c = a.astype('int32')
. using switch a = pm.math.switch(tau >= X, a1, a2)

Sometimes it even cause unexpected consequences, for example, poor initialization for NUTS (e.g., see here).

One of the workaround is reparametrization, for example, instead of using switch:

tau = pm.Uniform('tau', lower=0, upper=100)
a = pm.math.switch(tau >= X, a1, a2)

Appoximated the switch point with a Sigmoid function

tau = pm.Uniform('tau', lower=0, upper=100)
weight = tt.nnet.sigmoid(2 * (X - tau))
a = weight * a1 + (1 - weight) * a2

Sampling time is very long (minutes per sample)
Geometric variable not being properly Sampled
  • Help my model is slow! It takes hours to sample!

anwsered by @aseyboldt (see Possible Performance Issue Using Sparse Matrices in Bipartite Noisy-Or Model)
The first thing I do in cases like that is usually to try and find out why it is slow. There are two basic reasons for the slowness. Either the gradient evaluations take a lot of time or we need a lot of gradient evaluations for each sample. If it is the first, the problem is the shape of the posterior, so you need to look into reparametrizations that reduce funnels or correlations. If it is the second, you can use theano’s profiling tools to figure out where the problem is.

You can access the profile using something along those lines:

func = model.logp_dlogp_function(profile=True)
x = np.random.randn(func.size)
%timeit func(x)


You can access the number of grad evals using trace.tree_size. There are also a couple of other diagnostics that you could look at, there is a list in the doc for pymc3.NUTS.

Also check out this reply, which includes some more information: PyMC3 slows rapidly with increasing numbers of parameters

Improving speed with large dimensional problem
Mixed multivariate Gauss distribution
  • Getting NaN or Inf during NUTS or VI (e.g., 'Bad initial energy: inf. The model might be misspecified., NaN occurred in optimization.)

How to debug Bad initial energy: nan?
Getting 'Bad initial energy: inf' when trying to sample simple model
Model specification problem?
How to use method
  • NaN occurred in optimization at the first iteration in VI (i.e., seeing something like 0%| | 0/5000 [00:00<?, ?it/s])

Check the initial approximation parameters: approx.approx.params[0].eval(). If it returns nan, that is what causing the error.

Setting the value manually to zero could solve this. Below is an example if you are doing ADVI:

mu_approx = approx.approx.params[0]
rho_approx = approx.approx.params[1]


NaN occurred in optimization with ADVI
NaN occurred in optimization (error in DL with variational minibatches)
  • How can I get the Model logp?

How to calculate log posterior of a GP over a trace in pymc3

This comes up a couple time:

How to do sample_ppc after fitting the model that contains pm.Minibatch?

Answer: you cannot do that directly. But there is a workaround.


My model was also very slow. I got the following error: “g++ not detected ! Theano will be unable to
execute optimized C-implementations (for both CPU and GPU) and will default to
Python implementations.”
What solved the problem for me was to install libpython (conda install mingw libpython).

  • The input of DensityDist is not just the observed, but the dictionary including potentially the parameter and the observed: