Exploring Streaming / Minibatch Inference in PyMC

jhanani · March 20, 2026, 11:44am

Hello! I am Jhanani. I’ve been recently following the Streaming/Online Inference for GSOC 2026.

Recently, I have been working with PyMC and implemented a small prototype to understand streaming and minibatch-style inference.

This demo illustrates how posterior distributions for theta_A and theta_B evolve as new data chunks arrive, highlighting the concept of streaming or minibatch inference. We observe that the posterior means gradually converge to the true parameter values, demonstrating that knowledge accumulates effectively over successive chunks. While this example manually updates priors using previous posterior counts, true PyMC minibatch inference would use pm.Minibatch with variational inference (ADVI) to handle large datasets efficiently. This conceptual demonstration provides a foundation for exploring real minibatch streaming inference with larger datasets and integration with Dask.

Here is the code snippet and output:

I would appreciate your guidance on the following:

Should I focus on implementing pm.Minibatch with ADVI first, or continue refining conceptual streaming approaches like this demo?
When using minibatches in PyMC, how should the likelihood be scaled to correctly approximate full-data inference?
For this project, should I primarily focus on variational inference (ADVI), or also explore extending MCMC methods such as NUTS?
For the proposal, would a notebook demonstrating minibatch inference with simulated data be sufficient, or should I aim for a more advanced prototype?

zaxtax · March 23, 2026, 5:43pm

I think a simple notebook is a great start! No need for an advanced prototype just yet

jhanani · March 24, 2026, 7:09pm

Thank you for the guidance. I will focus on developing a clear notebook to explore streaming-style inference in PyMC.

I wanted to clarify the direction to ensure I’m aligning well with the project goals. I am considering a few approaches, such as focusing on minibatch-based inference, simulating streaming data with incremental updates, or incorporating tools like Dask for handling chunked data.

Which of these directions would you recommend prioritizing at this stage?

zaxtax · March 25, 2026, 5:32pm

Focus on the base functionality in the library itself. This is a good chance to get to know how pytensor works, how we do minibatching now, and how we might do it better. The dask integration will come later.

jhanani · March 26, 2026, 5:37pm

Thank you so much for the guidance. I will try to focus on these areas.

Topic		Replies	Views
GSoC 2026: Interest in Streaming inference project - Harshith gsoc	11	209	April 21, 2026
GSOC'25 (Streaming inference)	2	85	April 7, 2025
Using ADVI (with mini-batch) with streaming data Questions	18	1769	June 13, 2018
GSoC 2026: Streaming Variational Inference for Large Datasets Development gsoc	2	49	March 24, 2026
How to make Minibatch for multi-dimensional data? Questions	10	2565	September 17, 2020

Exploring Streaming / Minibatch Inference in PyMC

Related topics