We had some thoughts about new computational backend on top of pytorch to get dynamic graphs and flexible modeling. The thing is essentially needed is pointing out special purpose “We do we need something new?”. Let’s discuss it here
For me it’s the possibility to have undirect graph and Markov blanket, both are not possible/very difficult with static graph
Thinking over I came up to mind that pytorch is not suitable for bayesian reasoning. At least in our current ideology. We can’t do any symbolic inputs there that is core part in pymc3
@ferrine Can you elaborate a bit?
Why do you need symbolic computation for Bayesian Reasoning? If there is information/uncertainty floating from one node to the other, it shouldn’t matter whether the graph is defined beforehand or on the fly.
I think the problem is how to do something like this in pytorch:
with model:
a = pm.Uniform('a', 0, np.pi/2)
b = pm.Deterministic(tt.arctan(a))
This is executed only once, so how would later know that we need to recompute np.arctan(a)
? We’d have to change the whole model definition as far as I can see. I think pymc2 did something like that, that’s why you had to define functions all the time. (not sure though, haven’t used it much).
PyMC2 variables always knew who its parents and children were, so it knew what did and did not need to be recomputed any time a variable changed value.
Hi, author of https://github.com/stepelu/ptstat here, one of your core developers pointed me to this discussion.
From my cursory checking of PyMc3 it seems that it relies on a static graph approach, while PyTorch adopts a dynamic one, like Chainer.
Out of curiosity, is the approach for gradients (when needed) something among the lines of the “Gradient Estimation Using Stochastic Computation Graphs” paper?
I’d like as much as possible to avoid re-inventing the wheel (also given I already developed a scientific library for another language), so I welcome integration / cooperation on this topic.
Hi, I’ve looked at the paper. I don’t think we use a lot of complicated things when computing gradients as they come out of the box from theano
.
What about this backend?
dynet also supports both static and dynamic graph.
Just to keep the discussion alive.
Some summary of the discussion elsewhere:
PyTorch(Backed by Facebook)
- Pros:
- Dynamic graph
- easy GPU acceleration
- seems to be growing fast and well supported.
- use of standard debugging tools
- Numpy-like syntax (probably easier for end-users)
- higher-order derivatives
- Cons:
- Not well suited for tricks we did in pymc3 (array_ordering).
- MCMC can be hard to implement because of pytorch design. For now we collect all variables and compute joint_logp, that is tricky if you have dynamic graph. It would require us to change how models are defined (to something looks similar to PyMC2).
Some remarks:
If we go for pytorch we should focus on variational inference for deep learning, it’s rather well suited. For example, if we ever wanted to do RL, seems like a dynamic graph would be the way to go.
Not sure automatic graph optimisations are possible. There seems to be some work towards a tracing jit, but that seems to only eliminate runtime overhead right now (if I understand it correctly).
Basing a Bayesian framework on a dynamic computational graph creates one major challenge, the ability to have a clean user interface to express the models in. But to counter this one gets the inherent flexibility and ease of use of a dynamic computational graph.
It becomes orders of magnitude easier to implement complicated algorithms. It took me less than an hour to implement nuts from scratch in pytorch. While I still have not seen any NUTS implementations in TF, despite two existing bayesian frameworks is in TF already. This certainly extends to more cutting-edge VI applications as well as just more flexible bayesian models.
One added benefit of pytorch is that they are aiming to support an interface where pytorch is a clean drop-in replacement for numpy i.e. replace import numpy as np with import torch as np and it should just work. Pytorch is not alone in having numpy as guideline for their interface. This means that it would be possible to support multiple backends as long as their syntax is close enough to numpy/pytorch i.e. support pytorch, chainer, mxnet(gluon) as different backends.
I think getting a good user interface with a dynamic graph framework would be a real problem. And we couldn’t really expect graph optimizations / compilation at all (like with nnvm compile through mxnet or XLA in tf).
I don’t really see the advantage of implementing NUTS in the framework, we don’t need any additional gradients, so we can do it outside of it easily. That is, as long as we don’t have a lot of overhead from moving memory in and out of the framework (which seems to be the case for tf at least, and to a much lesser extend with mxnet). The dynamic graph might be quite interesting for VI however.
I don’t think having multiple backends is feasible. There are a lot of subtle differences between the frameworks that make this very difficult.
The fact that mxnet has both a static and dynamic framework combined makes it very interesting in my opinion. The static framework allows us to keep the declarative model definition, and the dynamic one allows us to play with algorithms that are difficult to implement statically.
I suggest this is a very productive mindset: thinking programming model
first and framework
/ backend
second allows us to design at finer grain than “pros/cons”.
In a separate but related context, I am also looking to decide between TensorFlow
and other frameworks from the point of view of automatic differentiation
capabilities.
Can someone point to a spec on how PyMC4
depends on Theano
from a functional and design point of view? (what is the interface)
This interface can be refined and evolved until it is what we want and then implemented for different backends like suggested above.
I plan to follow a similar approach to isolate the calculus engine from TensorFlow
(if at all possible). And the gradient computation
and statistical inference
are related.
Thoughts?
I think from the point of view of automatic differentiation they are more or less the same. Some frameworks allow you to take the second derivative easier (e.g., chainer and tf).
If you meant PyMC3, you can have a look at pymc3/model.py. Theano is used mostly for building the logp function and its derivative. Theano is also heavily used in VI with lots of tricks.
Reading this great article Introduction to Automatic Differentiation made me think that the implementation choices can make a serious difference.
Seems that “Hooks for custom derivatives” is related to what TensorFlow
does here: https://www.tensorflow.org/extend/adding_an_op.
Anyway, sorry to discuss this here.
My point is that it would be good to have a DSL capturing the syntax and semantics of this functionality so that different implementations can be used while the DSL is stable.
See the general idea of #DenotationalDesign here: “Lambda Jam 2015 - Conal Elliott - Denotational Design: From Meanings To Programs” https ://youtu.be/bmKYiUOEo2A?t=777
Curious if project stakeholders find this technique applicable.
meant PyMC3, indeed, thanks.
So having a TensorFlow
or PyTorch
backend is really about implementing this interface?
Or more about a more intrusive change to the overall programming model?
What is VI please?
Thanks for the pointer - I have not put much thought into automatic differentiation before.
For VI I meant the variational inference module in PyMC3 https://github.com/pymc-devs/pymc3/tree/master/pymc3/variational
Depending on the backends, some would mean a complete change of API. But in general, most changes would be major, as there are many theano magic were involved in varies part of the project.