Theano static computational graph optimization vs PyTorch/TF

ionuttamas · December 26, 2020, 10:43pm

I was reading this post here https://pymc-devs.medium.com/the-future-of-pymc3-or-theano-is-dead-long-live-theano-d8005f8a0e9b and was wondering if someone can give me a few examples or use-cases + explanations where Theano-PyMC static computation graph has clear advantages over TF or PyTorch.

I’m interested in both the overall static vs dynamic computation graph tradeoffs, as well as the implementation mechanisms where Theano brings clear advantages.

AlexAndorra · December 27, 2020, 2:07pm

The great @brandonwillard will have very interesting thoughts on this!

In the meantime, you can already read his comments on this HN thread (granted, most of the comments aren’t interesting, but some are instructive, especially Brandon’s).

I also just interviewed Brandon about symbolic computation on my podcast, and @_eigenfoo published a very interesting article about tensor libraries and their differences.

Hope this helps

brandonwillard · December 27, 2020, 10:33pm

The C and JAX transpilation isn’t possible without “static” graphs (i.e. single graphs that represent the entirety of a computation), as well as any graph rewrites (e.g. “optimizations”) that use information about parent and/or child nodes. Conversions of operations to their in-place counterparts are one example of that.

Here’s a walkthrough of some simplifications that are not present in TensorFlow and how they improve the accuracy of the resulting computations.

There are considerably more of these in Theano-PyMC than TensorFlow, and, more importantly, they’re programmable in Theano-PyMC, but not in most/all of the other tensor libraries.

Most simplifications like these aren’t possible with a “dynamic” graph (i.e. a result computed from (sub-)graphs that are partially constructed and/or discarded on the fly). For example, how can you simplify x / x if you don’t have some representation of that operation (i.e. a graph)?

By the way, if the distinction between “dynamic” and “static” graphs are ever only about how you represent the same information and not whether such information is available in the first place, then the distinction is arguably inconsequential–or pointless outside of the context of a specific library.
For instance, if all the operations and their inputs are tracked in a way that allows one to produce a “static” graph, but that information is simply not represented in a standard “graph” format, then said library does have a static graph; they’ve just obfuscated it, and there really should be a good reason for doing that. More likely than not, any reason would again be library-specific (e.g. the standard graph classes are poorly designed and can’t be efficiently created and/or discarded).

Notice how that TensorFlow example was only possible after forcing TensorFlow to create a “static” graph, so simply performing that sort of inspection requires a static graph.
If you want to know anything about how things are computed, you need all the computational steps in one place, and a “dynamic” graph doesn’t provide that.
If you’re only tracking some of that information during the process of “dynamic” graph creation, then you only have a scope-limited/locally “static” graph. In other words, the “dynamic” graphs made available to you have limited memory; they “forget” the operations that were performed “dynamically” and give you no recourse to change that.

What if you had a custom low-level operator that efficiently implemented an operation that consisted of many “eagerly”/“dynamically” computed operations used in tandem? You would never be able to automatically apply such an optimization without a static graph, because you wouldn’t even know when it could be applied, or–for that matter–have anything to which it could be applied!

More specifically, let’s say that such an operator is only practical in the context of a large graph in which said operation is called repeatedly (e.g. in a loop or an effectively unrolled loop). How would you be able to assess that without one global picture of the operations being performed?

As I implied earlier, there’s also a relationship between “eagerly” computed graphs and these “dynamic” graphs that’s worth addressing.
Eagerly computed graphs are simply a shift of responsibility; they offload computations to the user-level and–as a result–needlessly blur the lines between NumPy and the construction of a graph, complicate efforts to script operations (e.g. if I’m iteratively constructing and/or evaluating graphs I must now add logic for “eagerly” and non-“eagerly” computed results, as well as swallow the local costs of numerous “eager” computations), and–per the above–they restrict optimizations to only the non-“eagerly” computed operations.

AlexAndorra · December 28, 2020, 10:05am

That’s what one calls a good bet

Topic		Replies	Views
Tensorflow backend Development	2	983	September 19, 2017
Theano graph in PyMC3 Model Questions theano	3	521	January 29, 2021
MxNet backend for PyMC4 PyMC4	2	2075	November 17, 2017
Theano deprecation at 1.0 Development	16	4089	October 7, 2017
Alternative Computation Backends for PyMC PyMC4	5	2291	June 7, 2018

Theano static computational graph optimization vs PyTorch/TF

Related topics