Exploring Streaming / Minibatch Inference in PyMC

Hello! I am Jhanani. I’ve been recently following the Streaming/Online Inference for GSOC 2026.

Recently, I have been working with PyMC and implemented a small prototype to understand streaming and minibatch-style inference.

This demo illustrates how posterior distributions for theta_A and theta_B evolve as new data chunks arrive, highlighting the concept of streaming or minibatch inference. We observe that the posterior means gradually converge to the true parameter values, demonstrating that knowledge accumulates effectively over successive chunks. While this example manually updates priors using previous posterior counts, true PyMC minibatch inference would use pm.Minibatch with variational inference (ADVI) to handle large datasets efficiently. This conceptual demonstration provides a foundation for exploring real minibatch streaming inference with larger datasets and integration with Dask.

Here is the code snippet and output:

I would appreciate your guidance on the following:

  1. Should I focus on implementing pm.Minibatch with ADVI first, or continue refining conceptual streaming approaches like this demo?

  2. When using minibatches in PyMC, how should the likelihood be scaled to correctly approximate full-data inference?

  3. For this project, should I primarily focus on variational inference (ADVI), or also explore extending MCMC methods such as NUTS?

  4. For the proposal, would a notebook demonstrating minibatch inference with simulated data be sufficient, or should I aim for a more advanced prototype?

I think a simple notebook is a great start! No need for an advanced prototype just yet

Thank you for the guidance. I will focus on developing a clear notebook to explore streaming-style inference in PyMC.

I wanted to clarify the direction to ensure I’m aligning well with the project goals. I am considering a few approaches, such as focusing on minibatch-based inference, simulating streaming data with incremental updates, or incorporating tools like Dask for handling chunked data.

Which of these directions would you recommend prioritizing at this stage?

Focus on the base functionality in the library itself. This is a good chance to get to know how pytensor works, how we do minibatching now, and how we might do it better. The dask integration will come later.

Thank you so much for the guidance. I will try to focus on these areas.