Hello! I am Jhanani. I’ve been recently following the Streaming/Online Inference for GSOC 2026.
Recently, I have been working with PyMC and implemented a small prototype to understand streaming and minibatch-style inference.
This demo illustrates how posterior distributions for theta_A and theta_B evolve as new data chunks arrive, highlighting the concept of streaming or minibatch inference. We observe that the posterior means gradually converge to the true parameter values, demonstrating that knowledge accumulates effectively over successive chunks. While this example manually updates priors using previous posterior counts, true PyMC minibatch inference would use pm.Minibatch with variational inference (ADVI) to handle large datasets efficiently. This conceptual demonstration provides a foundation for exploring real minibatch streaming inference with larger datasets and integration with Dask.
Here is the code snippet and output:
I would appreciate your guidance on the following:
-
Should I focus on implementing
pm.Minibatchwith ADVI first, or continue refining conceptual streaming approaches like this demo? -
When using minibatches in PyMC, how should the likelihood be scaled to correctly approximate full-data inference?
-
For this project, should I primarily focus on variational inference (ADVI), or also explore extending MCMC methods such as NUTS?
-
For the proposal, would a notebook demonstrating minibatch inference with simulated data be sufficient, or should I aim for a more advanced prototype?

