Hey everyone
i’m Harshith and i’m looking to apply for GSoC 2026 with PyMC, specifically the Streaming inference project.
a bit about my background…
i’ve been contributing to a few Python ecosystem’s, mainly Dask and NumPy.
Dask: i have submitted a few PRs which were related to distributed computation, partition metadata correctness, and serialization behavior, and Docs
NumPy: And in NumPy i’ve contributed towards enhancements and bug fixes related to dtype handling,CPU feature detection, and backend compatibility
i also recently opened a PR in PyMC (#8116) where i’ve extended logprob support for non-overlapping switch transforms with non-zero thresholds so essentially i’ve implemented a new measurable class, graph rewrite, and logprob logic for it.
given that i’ve already worked with Dask’s chunked computation model and Numpy, i’m really interested in working on streaming minibatch support and integrating Dask-backed data pipelines into PyMC.
so basically right now PyMC’s minibatch stuff assumes everything is already in memory which is a problem for large or streaming datasets. what i’m proposing is a streaming minibatch adapter that works with chunked or lazy data sources like Dask arrays or plain iterators. since Dask already handles chunked and lazy computation, the idea is to build an adapter that exposes a consistent minibatch interface while fetching data incrementally this way variational inference and Pathfinder could work on datasets larger than memory or data arriving continuously without loading everything upfront. I’ve already worked on an issue related to this in PR#12290 in dask
I would really appreciate any guidance on where to start exploring the current minibatch and inference internals, and how i can best prepare for this. thank you!!!