How NUTS evades the curse of dimensionality

When I run my model in NUTS with 20 or 2000 data points there’s no difference in sampling speed, unlike, e.g. emcee. I’ve read somewhere that NUTS is not affected by the curse of dimensionality. Is there a good video or and article explaining why is it so?

This talk by Michael Betancourt was really good in explaining the problem: https://www.youtube.com/watch?v=pHsuIaPbNbY. This blog post by Andrew Gelman is also very good http://andrewgelman.com/2017/03/15/ensemble-methods-doomed-fail-high-dimensions/.

In addition to that stuff @twiecki mentioned:
The number of parameters or data points usually isn’t a good measure of how hard the problem is. The best measure of performance we have is effective samples per second. That is influenced by those factors:

  • The number of effective samples per sample
  • The number of gradient evaluations per sample
  • The time to compute the gradient (plus some overhead, especially if the gradient is very fast)

If you want to have a look at those for a particular trace, you can get the effective sample size using pm.effective_n, the number of grad evals as trace.tree_size, and the time it takes to evaluate the gradient using

logp_dlogp = model.logp_dlogp_function()
logp_dlogp.set_extra_vars({})
x = np.random.randn(logp_dlogp.size)
%timeit logp_dlogp(x)

Now, let’s say you increase the number of data points, without changing the model or the number of parameters (in many models the number of parameters depends on the number of data points, but let’s ignore that here). Then in many cases the posterior will get tighter and start approximating a normal distribution more. Unless this introduces correlations, the number of effective samples per sample might increase because of this, and the number of gradient evaluations per sample might decrease, since it doesn’t need to follow the long tails of a distribution in long trajectories, where the step sizes aren’t well adapted. The time to compute the gradient will increase however (although maybe not as much as you would expect, as especially for small datasets a large part of the gradients is just overhead).

In short: the sampling speed doesn’t so much depend on the number of data points or parameters, but more on the shape of the posterior.

2 Likes