Reproducibility & Scalability with PyMC

Hello, PyMC community!

I wanted to share with you my first series ‘Reproducibility & Scalability’ on my personal blog. In four sub-posts I go over:

  1. How I setup my personal projects using Kedro a framework that aims to bring software engineering best practices to data science workflows
  2. How I personally use the ModelBuilder class from pymc-extras to get a PyMC model ready for production use
  3. How I integrate MLFlow into the Kedro workflow and track PyMC experiments
  4. How I set up a Ray cluster and run one PyMC model on many datasets asynchronously and in parallel

link to my blog

I also want to take a moment to express my appreciation to the PyMC community. My entire journey to knowledge-sharing was heavily inspired by @juanitorduz and his incredible blog. I have also been trying to be more involved in the open source community by contributing to the code base and by answering some questions on the discourse (admittedly sometimes giving bad advice…). I really appreciate the patience and kindness that @ricardoV94 @jessegrabowski and @bwengals have given me.

The PyMC community is truly composed of really awesome people!

Thank you all!
Sincerely,
Jonathan

5 Likes

Thanks for you kind words @Dekermanjian !

Your blog looks really nice! I think a more “mature” version of the model builder can be found in model_builder — Open Source Marketing Analytics Solution . also, there is great MLflow support for PyMC models in mlflow — Open Source Marketing Analytics Solution, thanks to @williambdean (williambdean (Will Dean) · GitHub)

BTW: have you tried any Ray + PyMC integration (just curious)

Thank you @juanitorduz!! That is very cool, I did not know about the model_builder or MLFlow in the PyMC-marketing module. I will definitely check it out!

For the Ray + PyMC, do you mean create a module that handles scaling PyMC on Ray? I have thought about implementing it but right now I would only be able to implement model parallelization because MCMC is inherently sequential. But interestingly I was reading a paper where you can implement data parallelization with MCMC by using the Shepherding distribution and a shorter resource. I was thinking about seeing if I could implement this in PyMC and then connect everything with Ray to allow both model and data parallelization.

1 Like