GSoC 2019 and pymc4 contribution

Hey everybody!

My name is Rasul and I’m a DS student at Skoltech, Russia. I really want to contribute to the pymc4 and work on it during this summer. I’ve seen couple issues and ideas on how the development can be progressed further. Still, I lack the knowledge on how the current architecture is going be improved and what are the global design issues right now. (and that is, I guess, the main part of the project for this GSoC)

Should I concentrate on the open issues for now or there is a better way to be involved in it?

Hi, Rasul! Happy to see you there.

I think commenting your thoughts on our current issues and ongoing PRs is the best way to get involved. As you may see, pymc4 development is now a lot about testing new directions and discussions about design choices. It is a strategical decision to start with a well designed architecture to easily extent pymc4.

Global design issue is a lack of theano related “neat features” like theano.clone and tensorflow graph management: add only. Moreover tensorflow is moving forward to functional eager first api (like pytorch). That’s why we prioritize our efforts on functional design.

Goals are

  1. Model is a function: data generative process
  2. Models don’t change, but every modification returns a copy, like in Pandas.
  3. Execute model function 1st time after configure
  4. Make xarray a first-class citizen (used for input data as well as storage of samples)
  5. Allow creation of submodules by treating a model like a RV

Challenges are:

  1. debugging. How does one get a good error message before hi starts sampling? In pymc3 graph was built in runtime and the was no such issue. Now we have a delayed construction. (but we can run an inspection very first run)
  2. inspection. How do we inspect pymc4 model like we did in pymc3? Setting smart starting values for sampling, Transform variables, etc
  3. reparametrization. It is a common problem in hierarchical models to choose a parametrizations. e.g. cantered vs non-centered. We would like to create a unified way to do that
  4. Variables as models. Some random variables are itself a composition of other variables (Horseshoe, WishartBartlett). The best way to deal with it is to treat ANY variable or model as a model with same API. This direction is not yet well explored. In my opinion it is the best design idea so far.

If you have any comments or clarifications from me or other developers, feel free to ask