Frequently Asked Questions

  • Is find_MAP still encouraged in practice? i notice a lot of code samples stopped using find_MAP (asked by @kpei)

find_MAP is pretty much discouraged now in most situations. The MAP is not a representative point in high dimensions, because the MAP is usually not in the typical set (Betancourt’s paper is the best reference Moreover, it can make life difficult for NUTS, as the gradients are small at the MAP.

  • Is NUTS sampler with ADVI always encouraged? I know that it only supports continuous (asked by @kpei)

Use NUTS when you can, it scales reasonably in higher dimensions. How to properly initialize it is a bit of an open question right now, but advi is probably the best we have at the moment: it gives a very good scaling matrix for NUTS. And in general, the estimation from ADVI is quite close to the posterior from MCMC (you can even get better result using FullRankADVI)
[EDIT]: We now have a new default for initialising NUTS, which is close to what STAN does by adapting the mass matrix during tuning.

  • I see a flat posterior distribution using ADVI, what is happening?

This is usually an indication that the gradient is zero for that random variable. It happens if some of the manipulation of the random variables breaks the smooth geometry property. Even if all your random variables are continuous, some operation will break the gradient, for example:

. casting a variable to int: c = a.astype('int32')
. using switch a = pm.math.switch(tau >= X, a1, a2)

Sometimes it even cause unexpected consequences, for example, poor initialization for NUTS (e.g., see here).

One of the workaround is reparametrization, for example, instead of using switch:

tau = pm.Uniform('tau', lower=0, upper=100)
a = pm.math.switch(tau >= X, a1, a2)

Appoximated the switch point with a Sigmoid function

tau = pm.Uniform('tau', lower=0, upper=100)
weight = tt.nnet.sigmoid(2 * (X - tau))
a = weight * a1 + (1 - weight) * a2
  • Help my model is slow! It takes hours to sample!

anwsered by @aseyboldt (see Possible Performance Issue Using Sparse Matrices in Bipartite Noisy-Or Model)
The first thing I do in cases like that is usually to try and find out why it is slow. There are two basic reasons for the slowness. Either the gradient evaluations take a lot of time or we need a lot of gradient evaluations for each sample. If it is the first, the problem is the shape of the posterior, so you need to look into reparametrizations that reduce funnels or correlations. If it is the second, you can use theano’s profiling tools to figure out where the problem is.

You can access the profile using something along those lines:

func = model.logp_dlogp_function(profile=True)
x = np.random.randn(func.size)
%timeit func(x)


You can access the number of grad evals using trace.tree_size. There are also a couple of other diagnostics that you could look at, there is a list in the doc for pymc3.NUTS.

Also check out this reply, which includes some more information: PyMC3 slows rapidly with increasing numbers of parameters

  • Getting NaN or Inf during NUTS or VI (e.g., 'Bad initial energy: inf. The model might be misspecified., NaN occurred in optimization.)
  • NaN occurred in optimization at the first iteration in VI (i.e., seeing something like 0%| | 0/5000 [00:00<?, ?it/s])

Check the initial approximation parameters: approx.approx.params[0].eval(). If it returns nan, that is what causing the error.

Setting the value manually to zero could solve this. Below is an example if you are doing ADVI:

mu_approx = approx.approx.params[0]
rho_approx = approx.approx.params[1]

  • How can I get the Model logp?

This comes up a couple time:

How to do sample_ppc after fitting the model that contains pm.Minibatch?

Answer: you cannot do that directly. But there is a workaround.

1 Like

My model was also very slow. I got the following error: “g++ not detected ! Theano will be unable to
execute optimized C-implementations (for both CPU and GPU) and will default to
Python implementations.”
What solved the problem for me was to install libpython (conda install mingw libpython).

  • The input of DensityDist is not just the observed, but the dictionary including potentially the parameter and the observed:

NOTE: This is no longer correct for PyMC > 4.0

  • How can I marginalized a model with latent discrete variables?
1 Like
  • I am getting shape errors when doing posterior predictive sampling after calling set_data

By default observed variables take their shape from the shape of the observed data. This is a convenience, so users don’t have to specify shape manually (via shape argument or dims: see Distribution Dimensionality — PyMC 5.5.0 documentation)

import pymc as pm

with pm.Model() as m:
  x = pm.MutableData("x", [0.0, 1.0, 2.0])
  data = pm.ConstantData("data", [0.0, 2.0, 4.0])

  beta = pm.Normal("beta")
  mu = beta * x
  y = pm.Normal("y", mu, observed=data)
  # This is equivalent to
  # y = pm.Normal("y", mu, observed=data, shape=data.shape)

  idata = pm.sample()

with m:
  pm.set_data({"x": [0.0, 1.0, 2.0, 3.0]})
ValueError: shape mismatch: objects cannot be broadcast to a single shape.  Mismatch is between arg 0 with shape (3,) and arg 1 with shape (4,).

This happens because y has a fixed shape of (3,).
The error can be obtained directly from numpy like this:

import numpy as np

np.random.normal([0.0, 1.0, 2.0, 3.0], size=(3,))

The recommended fix is to specify how the shape of y depends on the parameter we intend to change

import pymc as pm

with pm.Model() as m:
  x = pm.MutableData("x", [0.0, 1.0, 2.0])
  data = pm.ConstantData("data", [0.0, 2.0, 4.0])

  beta = pm.Normal("beta")
  mu = beta * x
  y = pm.Normal("y", mu, observed=data, shape=mu.shape)

  idata = pm.sample()

with m:
  pm.set_data({"x": [0.0, 1.0, 2.0, 3.0]})

Examples available in the docs: pymc.set_data — PyMC 5.5.0 documentation

1 Like
  • Multi-core sampling is very slow / using more cores than specified.

PyMC graphs can invoke computations from other libraries (such as scipy, numpy, PyTensor) that are natively multi-threaded by default. This can cause the number of cores used to be larger than the number of cores specified in sample and may slowdown sampling altogether. Calculating Deterministics on the main thread can also use cores beyond those needed by the individual chains.

See more details and tips about how to control this behavior in the following threads:

1 Like