I joined here (from the Stan forums) to say that @bob-carpenter and the other founding Stan members are incredible. The apocryphal story (me making this up) is they looked at Theano, couldn’t understand the doc, so they said NUTS we’ll just build our own AD system from scratch, our own ppl, our own compiler, and this new fangled multinomial hmc thing.
Since the topic is about user experience, I hope my comments are relevant.
I spend most of time writing ODEs systems for infectious disease modelling. I appreciate that there are several research papers out there that provided the workflow for the analysis based on STAN (especially after COVID). Those workflows are quite useful and relevant to me too. At the time when I started modelling though, I found the model specification in STAN a bit daunting because my model structures are quite hard to formalise into equations into STAN. If those workflows were available at the beginning, I probably would have made a different decisions.
PyMC was easier for me to pick up. The process from ideas to model structures to codes is faster too. I think because I was more familiar with Python so that’s that.
On the side note, I saw this tutorial for Numpyro in epidemiology. Look very interesting. After I finished my thesis, I will have look at them. My feelings is Numpyro is also very Python-centric (?)
I’d like to add that PyMC, as a PPL, includes some very useful and powerful “helper” (?) functions to build models and understand my problem. Thanks to the PyMC developers.
Thanks—this is really great. I’d love to hear from more users like this.
This is the common refrain I’ve heard from PyMC users—they really really don’t want to learn another language on top of Python. For reasons I don’t understand, the R users are less resistant. Maybe because the syntax is more similar? I would have expected more resistance because R is also implicitly typed. If I had it to do over again, I would have followed NumPy’s structure and not tried to differentiate arrays and linear algebra types the way that Boost (C++ matrix library) does. The advantage of doing it the way Boost does is that you can infer the types of the result. A vector times a row vector is a matrix and a row vector times a vector is a scalar. Back when I was a professor I worked on a mix of PL (programming languages) and NLP (natural language processing), largely around typing.
Could you elaborate on what the helper functions do?
Memory is never very reliable, but my recollection is that we didn’t want to use Theano for two reasons.
a. It was Python embedded, and we didn’t want a Python dependency.
b. We thought it was doing purely symbolic autodiff, which we thought was a dead end for efficiency (it’s hard to generate performant derivative code symbolically—reverse mode autodiff automatically creates a dynamic programming solution for shared variables).
For (b), this was mainly me, Matt Hoffman, and Daniel Lee. The big decision point for us was whether to work dynamically (like PyTorch) or statically (like TensorFlow). We went with dynamically because it was going to be much easier to launch a math libray and more flexible for users. In retrospect, given the architecture of GPUs, the TensorFlow approach adopted by JAX makes a lot of sense.
Also, it was really hard to find this stuff. Autodiff was super niche when we started in 2010, mostly being used by the applied math community to do sensitivity analysis on solvers. We didn’t find ADMB until much later, for example—that was a really groundbreaking autodiff (AD) model building (MB) system consigned to obscurity in the fisheries and wildlife community. Sort of the way emcee
is largely used by the astrophysics community (it’s pure Python and gradient free, with really clever auto-tuning that is similar to what Matt Hoffman’s doing these days).
Theano was definitely doing reverse (and more patchly forward mode) AD all along, and worrying about avoiding duplicate computations between forward and autodiff. It also had higher order AD from the beginning.
Tensorflow 1.x was greatly inspired by Theano.
I’ve recently worked with stan again a little after mostly using PyMC for years, and the thing I missed most was (perhaps surprisingly) support for named dimensions. It might sound like a pretty minor detail, but to me it really makes a difference in practice.
And that is even though the pymc support is hardly perfect. The random variables remember the dimension names, but all intermediate variables don’t and just use numpy broadcasting rules.
If I were to redesign something, I think that would be pretty far up my list of things I want to support.
The numpy broadcasting rules are actually a bit of a pain to work with in pytensor too. If you want to follow it strictly, you can’t tell until runtime what for instance a multiplication of two arrays will actually do. Depending on if a shape is one, it might broadcast, error out, or do elementwise multiplication. But which one happens has implications for the derivatives, because those won’t be the same in those cases.
If I could go back in time, I’d love to tell the numpy devs to not use a dimension length of 1 to indicate that something can be broadcast, but a dimension length None
instead. I think that would have prevented a lot of strange bugs over the years…
This has been really high among our user’s requests. Probably because it’s pretty well supported everywhere in R.
While you’re at it, could you slip a note to the R developers to not conflate size 1 arrays with scalars. That is such a mess on the debugging side, but R goes a lot further into the weeds than NumPy.
It could be force of habits as I was taught C++ and Matlab during my college’s days.
Thanks a lot for providing the background knowledge about Stan (and Bayesian methods). Very good learning materials!
I like the ArviZ library in general. Within the PyMC, there are two functions:
pm.Potential
. Handy function to impose constraints. I used this function to weight my data points. I need to do evidence synthesis from multiple sources, so being able to weight data is great.pm.Truncated
. When building model, I need to look at a particular regions of my variables for various reasons (e.g., experimenting with model structures/ functions, sourcing for priors from literature or sometimes, just to get some results because the model is not fitting properly). I do made a point to remove all these Truncated distributions in the final model.
The Stan doc recommends ArviZ
for CmdStanPy
. Aki Vehtari wrote some of the code for the methods he refined (R-hat, ESS, LOO) and made sure they get the same results as Stan’s R package posterior
and as our reference implementations in C++. I really like seeing the results synchronized like this on state-of-the-art implementations.