Hi, I’ve looked at the paper. I don’t think we use a lot of complicated things when computing gradients as they come out of the box from `theano`

.

What about this backend?

dynet also supports both static and dynamic graph.

Just to keep the discussion alive.

Some summary of the discussion elsewhere:

PyTorch(Backed by Facebook)

- Pros:
- Dynamic graph
- easy GPU acceleration
- seems to be growing fast and well supported.
- use of standard debugging tools
- Numpy-like syntax (probably easier for end-users)
- higher-order derivatives

- Cons:
- Not well suited for tricks we did in pymc3 (array_ordering).
- MCMC can be hard to implement because of pytorch design. For now we collect all variables and compute joint_logp, that is tricky if you have dynamic graph. It would require us to change how models are defined (to something looks similar to PyMC2).

Some remarks:

If we go for pytorch we should focus on variational inference for deep learning, it’s rather well suited. For example, if we ever wanted to do RL, seems like a dynamic graph would be the way to go.

Not sure automatic graph optimisations are possible. There seems to be some work towards a tracing jit, but that seems to only eliminate runtime overhead right now (if I understand it correctly).

Basing a Bayesian framework on a dynamic computational graph creates one major challenge, the ability to have a clean user interface to express the models in. But to counter this one gets the inherent flexibility and ease of use of a dynamic computational graph.

It becomes orders of magnitude easier to implement complicated algorithms. It took me less than an hour to implement nuts from scratch in pytorch. While I still have not seen any NUTS implementations in TF, despite two existing bayesian frameworks is in TF already. This certainly extends to more cutting-edge VI applications as well as just more flexible bayesian models.

One added benefit of pytorch is that they are aiming to support an interface where pytorch is a clean drop-in replacement for numpy i.e. replace import numpy as np with import torch as np and it should just work. Pytorch is not alone in having numpy as guideline for their interface. This means that it would be possible to support multiple backends as long as their syntax is close enough to numpy/pytorch i.e. support pytorch, chainer, mxnet(gluon) as different backends.

I think getting a good user interface with a dynamic graph framework would be a real problem. And we couldn’t really expect graph optimizations / compilation at all (like with nnvm compile through mxnet or XLA in tf).

I don’t really see the advantage of implementing NUTS in the framework, we don’t need any additional gradients, so we can do it outside of it easily. That is, as long as we don’t have a lot of overhead from moving memory in and out of the framework (which seems to be the case for tf at least, and to a much lesser extend with mxnet). The dynamic graph might be quite interesting for VI however.

I don’t think having multiple backends is feasible. There are a lot of subtle differences between the frameworks that make this very difficult.

The fact that mxnet has both a static and dynamic framework combined makes it very interesting in my opinion. The static framework allows us to keep the declarative model definition, and the dynamic one allows us to play with algorithms that are difficult to implement statically.

I suggest this is a very productive mindset: thinking `programming model`

first and `framework`

/ `backend`

second allows us to design at finer grain than “pros/cons”.

In a separate but related context, I am also looking to decide between `TensorFlow`

and other frameworks from the point of view of `automatic differentiation`

capabilities.

Can someone point to a spec on how `PyMC4`

depends on `Theano`

from a functional and design point of view? (what is the interface)

This interface can be refined and evolved until it is what we want and then implemented for different backends like suggested above.

I plan to follow a similar approach to isolate the calculus engine from `TensorFlow`

(if at all possible). And the `gradient computation`

and `statistical inference`

are related.

Thoughts?

I think from the point of view of automatic differentiation they are more or less the same. Some frameworks allow you to take the second derivative easier (e.g., chainer and tf).

If you meant PyMC3, you can have a look at pymc3/model.py. Theano is used mostly for building the logp function and its derivative. Theano is also heavily used in VI with lots of tricks.

Reading this great article https://alexey.radul.name/ideas/2013/introduction-to-automatic-differentiation/ made me think that the implementation choices can make a serious difference.

Seems that “Hooks for custom derivatives” is related to what `TensorFlow`

does here: https://www.tensorflow.org/extend/adding_an_op.

Anyway, sorry to discuss this here.

My point is that it would be good to have a DSL capturing the syntax and semantics of this functionality so that different implementations can be used while the DSL is stable.

See the general idea of #DenotationalDesign here: *“Lambda Jam 2015 - Conal Elliott - Denotational Design: From Meanings To Programs”* https ://youtu.be/bmKYiUOEo2A?t=777

Curious if project stakeholders find this technique applicable.

meant PyMC3, indeed, thanks.

So having a `TensorFlow`

or `PyTorch`

backend is really about implementing this interface?

Or more about a more intrusive change to the overall programming model?

What is VI please?

Thanks for the pointer - I have not put much thought into automatic differentiation before.

For VI I meant the variational inference module in PyMC3 https://github.com/pymc-devs/pymc3/tree/master/pymc3/variational

Depending on the backends, some would mean a complete change of API. But in general, most changes would be major, as there are many theano magic were involved in varies part of the project.

Read quite a bit recently about `automatic differentiation`

*“Automatic Differentiation: The most criminally underused tool in the potential machine learning toolbox?”* : https://justindomke.wordpress.com/2009/02/17/automatic-differentiation-the-most-criminally-underused-tool-in-the-potential-machine-learning-toolbox/

- You write a subroutine to compute a function f({\bf x}). (e.g. in C++ or Fortran). You know f to be differentiable, but don’t feel like writing a subroutine to compute \nabla f.
- You point some autodiff software at your subroutine. It produces a subroutine to compute the gradient.
- That new subroutine has the same complexity as the original function!

It does not depend on the dimensionality of \bf x.- It also does not suffer from round-off errors!

*“Automatic Differentiation Variational Inference”* : https://arxiv.org/abs/1603.00788

we develop automatic differentiation variational inference (ADVI). Using our method, the scientist only provides a probabilistic model and a dataset, nothing else. ADVI automatically derives an efficient variational inference algorithm, freeing the scientist to refine and explore many models. ADVI supports a broad class of models-no conjugacy assumptions are required. We study ADVI across ten different models and apply it to a dataset with millions of observations. ADVI is integrated into Stan, a probabilistic programming system; it is available for immediate use.

Have more if you like,

Another tangent about `probabilistic programming`

with `functional programming`

techniques.

Posted some papers on the `Figaro`

The rationale is to show how function composition can be used to create * composable MCMC algorithms*.

(video bookmark) https://youtu.be/erGWMzzSUCg?list=PLnqUlCo055hX6SsmMr1AmW6quMjvdMPvK&t=1103

(video bookmark) https://youtu.be/erGWMzzSUCg?list=PLnqUlCo055hX6SsmMr1AmW6quMjvdMPvK&t=1626

key insights about composing handers: sequential monte carlo (SMC) handler + MH handler => particle MCMC handler

Again, the intent is to suggest a language/DSL first, framework last approach to make the most out of this “crisis” caused by `Theano`

going away.

I was just about to share that here!

I’ve started running through the pyro docs examples, and oh boy, it looks powerful but the interface is seriously non-intuitive!

Then the thought came to mind: what if PyMC4 was a wrap-around pyro? Like Keras is for Theano/TF? Perhaps offloading the hard math part to the budding pyro community and defining the best interface for probabilistic programming? I’m quite under-informed on the kind of effort that’s needed for this, so this is just a thought, I guess…? I think I’ll get to bump into Colin tonight at Boston Bayesians, so I’ll try to get his thoughts…

@ericmjl That’s a really interesting thought. We have considered the same with Edward / BayesFlow. Essentially both of those packages are aimed at researchers giving a lot of flexiblity at the cost of intuitive syntax. These can be viewed as a middle-layer on top of the graph engine. PyMC3 always shined at being beginner friendly with easy syntax, so can be seen as targeting the top level.

Not sure the existing syntax could work with pyro, however, as the model creation needs to be rerun I think.

One benefit of dynamic graph would be on models with non-parametric priors such as CRP and IBP. I don’t see how these models can be sampled with static graphs.

I like this idea, but for now Pyro doesn’t implement MCMC. To my knowledge, Pyro is for Bayesian deep learning so it only has SVI.