PyTorch backend for PyMC4


Read quite a bit recently about automatic differentiation

“Automatic Differentiation: The most criminally underused tool in the potential machine learning toolbox?” :

  1. You write a subroutine to compute a function f({\bf x}). (e.g. in C++ or Fortran). You know f to be differentiable, but don’t feel like writing a subroutine to compute \nabla f.
  2. You point some autodiff software at your subroutine. It produces a subroutine to compute the gradient.
  3. That new subroutine has the same complexity as the original function!
    It does not depend on the dimensionality of \bf x.
  4. It also does not suffer from round-off errors!

“Automatic Differentiation Variational Inference” :

we develop automatic differentiation variational inference (ADVI). Using our method, the scientist only provides a probabilistic model and a dataset, nothing else. ADVI automatically derives an efficient variational inference algorithm, freeing the scientist to refine and explore many models. ADVI supports a broad class of models-no conjugacy assumptions are required. We study ADVI across ten different models and apply it to a dataset with millions of observations. ADVI is integrated into Stan, a probabilistic programming system; it is available for immediate use.

Have more if you like,


Another tangent about probabilistic programming with functional programming techniques.

Posted some papers on the Figaro

The rationale is to show how function composition can be used to create composable MCMC algorithms.

(video bookmark)

(video bookmark)
key insights about composing handers: sequential monte carlo (SMC) handler + MH handler => particle MCMC handler

Again, the intent is to suggest a language/DSL first, framework last approach to make the most out of this “crisis” caused by Theano going away.


And this today: Uber AI Labs Open Sources Pyro, a Deep Probabilistic Programming Language


I was just about to share that here!


I’ve started running through the pyro docs examples, and oh boy, it looks powerful but the interface is seriously non-intuitive!

Then the thought came to mind: what if PyMC4 was a wrap-around pyro? Like Keras is for Theano/TF? Perhaps offloading the hard math part to the budding pyro community and defining the best interface for probabilistic programming? I’m quite under-informed on the kind of effort that’s needed for this, so this is just a thought, I guess…? I think I’ll get to bump into Colin tonight at Boston Bayesians, so I’ll try to get his thoughts…


@ericmjl That’s a really interesting thought. We have considered the same with Edward / BayesFlow. Essentially both of those packages are aimed at researchers giving a lot of flexiblity at the cost of intuitive syntax. These can be viewed as a middle-layer on top of the graph engine. PyMC3 always shined at being beginner friendly with easy syntax, so can be seen as targeting the top level.

Not sure the existing syntax could work with pyro, however, as the model creation needs to be rerun I think.


One benefit of dynamic graph would be on models with non-parametric priors such as CRP and IBP. I don’t see how these models can be sampled with static graphs.


I like this idea, but for now Pyro doesn’t implement MCMC. To my knowledge, Pyro is for Bayesian deep learning so it only has SVI.


They are adding it:


There are a few teams also in progress to implement something similar to tensorflow.contrib.distribution in PyTorch. You can find their discussion of the design here:

[edit:] Seems it already made some progress:


What are the current running candidates for the pymc4 backend ? Pyro and Pytorch ?

I personally have moved away from tensorflow to pytorch because of its intuitive api design. With theano going out of development, pytorch in my knowledge stands out as the best library for creating computational graphs and running automatic differentiation.


@shkr: Thanks for your perspective. We haven’t really constrained the space too much. Options are: MXNet, TensorFlow, Edward, PyTorch, and Pyro. I’m listing packages as potential backends too as we could use those to build pymc3 API on top of.

Is this something you’d be interested in exploring, perhaps as part of GSoC (in case you’re a student)?


I am not a student, so can’t participate in GSoc. But I do find time for open source work, so I can contribute via PRs/Issues as needed.


My two cents as one of the users:

I want a working inference button. A natural continuation of thought that has lead to NUTS which is the workhorse of PyMC3 is Riemannian HMC and I’m itching to try it out. Despite many papers coming out, as far as I know no user-friendly packages exist for that yet, only “research-quality” code. I believe that whoever is first to provide a friendly package for RHMC is going to win quite some user base.

From my brief exposure it seems that people at STAN have been working on thist at least since 2014 but haven’t rolled it out yet, likely due to the fact that they have to write their own C++ for higher derivatives.

If we don’t want to block ourselves from being able to implement RHMC in the future I think we should pick a backend where higher order differentiation is a first class citizen. From brief googling none of the listed packages can fully claim this (although PyTorch seems to be moving in that direction). Autograd, claiming just that, was mentioned elsewhere, but was said to be slow. Are there some benchmarks?

From the above story with STAN I conclude that the more exotic syntax the backend will have the slower the development will be, both due to individual developer speed and inability of less advanced user to contribute (although the latter might be desired:)).


We had a GSoC project to implement RHMC last year but it turned out to be much more difficult than anticipated. Numerical stability becomes a huge issue. I think the reason that STAN hasn’t rolled it out is that it doesn’t work all that well in practice (but this is just my impression, I could be wrong). As such, my interest in RHMC is waning and I think other methods like L2HMC are interesting directions.


Here is a recent discussion about RHMC on Stan forums


The google tangent seems like the right level of abstraction for pymc’s backend. I haven’t used it much so don’t have any real world experience, found about it recently and just tried some examples.


Hi all. I’m a lurker in this community and just noticed this nice discourse tool. This look like a great forum.

I’ve been looking at the Machine Learning frameworks and the state of array computing in Python and it looks like we have had some wonderful spreading of capability over the past several years.

Meanwhile, a few of us have been working on the foundations so that a better system could emerge at some point. We have made a lot of progress in Numba, Dask, and more recently XND. There is also the interesting work around CuPy, Bohrium, and other GPU arrays. In addition, the general automatic differentiation story has been reifying.

Of course at the same time, Tensorflow, MXNet, PyTorch, and a few others have also defined ndarray concepts of their own — typically interfacing with NumPy and it’s more rigid type-system and writing a new function system outside of NumPy’s “ufunc” system.

In my mind, it is a good time for a new low-level “ufunc” system for array objects in Python as well as a refactoring of the capabilities of NumPy into lower-level components that can be re-used by things like PyTorch, Pyro, and other high-level systems.

I can see an opportunity, for PyMC4 to be a direct “customer” of the work we are planning to do on combining Numba + CuPy + Tangent along with XND to provide a more flexible array-container concept.

XND is not quite ready for even alpha-level consumption (we have docs to write and more ufuncs to build – but the bones are all there now if you want to take a peak: (the name comes from ex uno plures “from one comes many” which is a play on inverting e pluribus unum). The name emphasizes the idea that plures is about refactoring the capabilities of NumPy into C-libraries with Python interfaces that can then be re-used by many other systems. But, we will be rolling out the idea in a few months under the “xnd” brand which is the generic container that generalizes the NumPy container to things like variable-length arrays and is straightforward to extend with many-other kinds of data-types.

XND is Stefan Krah’s work product but he and I have been collaborating on the architecture for the past 2 years when I finally had an epiphany of how I felt the future of array computing should look like. And that future meant an expansion of the buffer protocol to multiple languages and a refactoring of NumPy into its core components (dtype system, ufunc system, array container)

We plan to make an early alpha release by May and then have a beta release by the end of the summer.

To summarize (and throw in one more idea) I have three separate suggestions:

  1. PyMC4 could support NumPy + CuPy + Tangent as a framework to build on for the future with NumPy/CuPy arrays as the array object.

  2. PyMC4 could also start to provide optional support for XND for data-types and features that are not otherwise available.

  3. PyMC4 could support Dask for creating parallel workflows (if you look at how distributed Tensorflow is architected, for example, it looks very similar to Dask, except Dask’s Python API is arguably much cleaner). If other things from PyTorch are missing, then perhaps Chainer could be used as it uses NumPy for the array object and does not introduce another array concept (fortunately this time around we have an extended buffer protocol and so you can still share data between competing arrays).

Thanks for reading this far. We have a plures gitter channel if you want to drop by and say hi:


I actually posted on Matthew Rocklin’s blog about using Dask instead of joblib, and he seemed skeptical that it would help ( Maybe you’re envisioning it at a different part of the computation?


Hi, here are some of my thoughts. (If anyone disagrees or find my ideas bogus, please correct me)

Some steps in Bayesian computing currently can be parallelized some cannot. It depends on the model if those steps are bottleneck or not.

  1. evaluation of the likelihood + transformation (these depend on the model, in some cases, they can use sparse methods and/or parallel methods)
  2. AD (reverse / forward mode; if one comes with a parallel and scalable method, that would be great; some clever optimization here are possible)
  3. Sampling method (e.g. NUTS cannot be parallelized (?), some variational methods can be).

Anybody knows if there is faster AD than that Stan has?