Hey everyone
i’m Harshith and i’m looking to apply for GSoC 2026 with PyMC, specifically the Streaming inference project.
a bit about my background…
i’ve been contributing to a few Python ecosystem’s, mainly Dask and NumPy.
Dask: i have submitted a few PRs which were related to distributed computation, partition metadata correctness, and serialization behavior, and Docs
NumPy: And in NumPy i’ve contributed towards enhancements and bug fixes related to dtype handling,CPU feature detection, and backend compatibility
i also recently opened a PR in PyMC (#8116) where i’ve extended logprob support for non-overlapping switch transforms with non-zero thresholds so essentially i’ve implemented a new measurable class, graph rewrite, and logprob logic for it.
given that i’ve already worked with Dask’s chunked computation model and Numpy, i’m really interested in working on streaming minibatch support and integrating Dask-backed data pipelines into PyMC.
so basically right now PyMC’s minibatch stuff assumes everything is already in memory which is a problem for large or streaming datasets. what i’m proposing is a streaming minibatch adapter that works with chunked or lazy data sources like Dask arrays or plain iterators. since Dask already handles chunked and lazy computation, the idea is to build an adapter that exposes a consistent minibatch interface while fetching data incrementally this way variational inference and Pathfinder could work on datasets larger than memory or data arriving continuously without loading everything upfront. I’ve already worked on an issue related to this in PR#12290 in dask
I would really appreciate any guidance on where to start exploring the current minibatch and inference internals, and how i can best prepare for this. thank you!!!
hey there i went through the minibatch and variational inference code base and i think i have a idea of what could be a plausible solution..
soo the main issue is in minibatch_rv.py where the logp gets scaled by total_size / batch_size but total_size has to be known upfront which breaks for streaming data.
for Dask integration the idea is pretty simple Dask arrays are lazy so they don’t load data until you call .compute(). so inside the existing fit loop in inference.py we can just do:
pyton
for chunk in dask_array.to_delayed():
shared_var.set_value(chunk.compute())
step_func()
this way only one chunk lives in memory at a time. total_size can be grabbed from dask_array.shape[0] if known, or set to None for infinite streams.
these are the changes i suspect:
1.minibatch_rv.py — make get_scaling handle None total_size
2.inference.py — add a streaming callback that pulls next Dask chunk each iteration
3.opvi.py — fix symbolic_normalizing_constant to handle dynamic scaling
rough idea->
python
data = da.from_zarr("large_dataset.zarr")
with pm.Model():
obs = pm.Data("obs", data[0].compute())
mu = pm.Normal("mu", 0, 1)
pm.Normal("likelihood", mu, 1, observed=obs, total_size=None)
approx = pm.fit(method="advi", streaming_data={"obs": data})
would love to know if this directions seems plausible! thank you.
Hey there,
So I was digging into how the streaming inference problem could be approached and I found an interesting research paper that talks about a similar problem, specifically “Streaming Variational Bayes”.
After reading through it, I found that this could be a plausible solution to the given project statement. So essentially, the idea is that instead of using ELBO with total_size (which needs the size of the dataset upfront), Streaming Variational Bayes uses each batch’s posterior as the prior for the following upcoming batch, and hence no dataset size assumptions would be needed.
I’ll be exploring more of the codebase to understand if this could be integrated without any complications, since there could be issues because PyMC uses PyTensor to build the ELBO graph. This could mean that we would need to work on opvi.py, because PyMC would need to automatically detect which prior parameters should be turned into shared variables. This could also mean a little more burden on the users as they’d have to know a little more about what PyTensor are..
Would love to hear any feedback @fonnesbeck could you have a look at this whenever you’re free? Thank you!
Thanks for your interest in the project! Tagging @zaxtax on this as well. Its good to see that you are familiarizing yourself with the code base. At the moment we are waiting to hear how many slots we would be getting for this year’s GSoC; in the meantime, getting started on a proposal and continuing to explore our VI implementation is a great way to prepare.
Can you send a link or reference to the paper you are referring to?
Hey there,
really appreciate the response and the feedback!!
uh so the paper i was referring to is Streaming variational Bayes - Tamara Broderick, Nicholas Boyd, Andre Wibisono,
Ashia C. Wilson, Michael I. Jordan
here’s the link [1307.6769] Streaming Variational Bayes
As for exploring VI implementation, i have already started to explore more of it!
will darft and share a blueprint or rather a road map of how the implementation and timelines would roughly look.
again, really appreciate the feedback!
Looking forward to hearing from you and @zaxtax as well!
-Harshith
So in terms of total_size both the streaming algorithms and something akin to scaling the data are both valid approaches. I would aim for us to follow the numpyro design and make each easy for the user to choose.
Hey there,
really appreciate the feedback!
I will be sharing the draft shortly, still making a few changes…
so based on what u have suggested,
I think i’d be worth implementing both as they complement each other well…
so essentially for the scaling path: A numpyro inspired plate style api
where the user can pass the subsample_size and Pymc would calculate the ELBO scalling factor,
but then again u do need to know total_size
as for SVB
we dont really need to know the total_size as,
it uses each batch’s posterior as the prior of the next batch, so no total_size assumption would be needed.
both of these can also be user selectable, either through a method=”scaling” vs method=”streaming” argument, or as two different and separate classes…
and we could also add a docstring describing what each of the following methods or classes do and when it would be ideal to use either of them.
Looking forward to hearing your thoughts!
Based on what @zaxtax and @fonnesbeck suggested there could be two different approaches for PyMC’s dependency on total_size during Variational inference…
and as mentioned before
these two approaches compliment each other well…
so based on that.
i’ve made a draft of the road map..
Phase 1 : NumPyro Scaling API (~80-90hrs)
firstly we got to fix the get_scaling() crash when total_size = None (essentially scaling isint done for unknown dimensions)
and then Add subsample_size to the create_minibatch_rv() which includes auto scaling
we gotta implement the pm.plate context manager (so the user never touches total_size
that goes to the create_minibatch_rv()
we have to also wire the pm.plate to symbolic_normalizing_constant (because as of currently it picks up the scaling from variables that are wrapped as minibatchrandomvariable directly)
Full tests: Unit tests for all the following added
- get_scaling() with None, partial None and also all the dimensions filled
- does the pm.idx have the right shape and does it sub_sample correctly…
- check the end to end ADVI with pm.plate on a dataset (here we shud check if the posterior match the already known true parameter
- and ofc also a regression test just to check everything else works as intended
(Earlier i was going to add the dask integration in Phase 1 but then realized that it fits better with SVB in phase 2 will add it later on… )
Phase 2 : SVB Streaming Inference ( ~225hrs)
posterior to prior params : a funcion that shud be able to extract mu/sigma from MeanField and fullrank aproximations after each batch
Make prior shared - Convert the fixed prior to pyensor.shared variables so as to work with SVB ( helps because it dosent touch the graph structure and as for the graphs pytensor builds static graphs so you cant really swap out the constants for a new value mid-execution)
Update prior - call the set_value and also clear node cache between batches for opvi.py
( if cache hasnt been cleared the varlogp,datalogp etc will keep returning pre updated values which might cause issue with the ELBO computation as it’d be wrong)
StreamingADVI(KLqp)- so this would be a new class that will inherit KLqp in inference.py and override fit() with a batch loop (as of currently ADVI.fit runs n steps on a single dataset… but iterating to a different batch while also carrying over what it has learned isint implemented, so essentially it helps connect all the other stuff like posterior to prior , updating the prior etc )
Dask integration: the streamingADVI.fit() will accept dask arrays and iterate over batches lazily. (this is done by calling .compute() inside _fit_batch so it only loads the batch when needed)
now finally we’ll have to add a way for users to switch between the two either thru a method=”scaling” and method=”streaming” argument, or as two different and separate classes…
Imporant - found an issue (will find a work around),
so ADVI always fits a gaussian approximate posterior regardless of what value of prior is specified by the user,
so lets say we are at batch x, after batch x we have Normal (mu, sigma) as our posterior but if the user’s orginal prior was HalfNormal ( this shud be +ve as Halfnormal is constrained to be positive ), we cant just replace it with an unconstrained normal (mu,sigma) directly…
I might have made some mistake’s
I’d love to hear feedback on how I could improve this! Thank you!
-Harshith
Hey there,
So after a lil digging into the code base i wanted to refine and correct a few things…
1.
On the halfnormal issue i flagged, is not an issue
since when HalfNormal("sigma") is seen, a new internal variable is created called sigma_log
and since log can take any value +ve or -ve, its now unconstrained.
2.
about the posterior extraction → they are already assigned as shared variables in approximation.py but we would still have to work on the prior side by making sure that the parameters are pytensor.shared variables so it can be updated with set_value().
3.
in the road map i had mentioned about clearing varlogp/datalogp node cache between batches, this step seems redundant here..
after understanding how the locally_cachedmethod works in opvi.py, the decorator caches the symbolic graph structure and not the evaluated values,
once we change the prior parameters to Pytensor.shared the graph will automatically pick up new values after each set_value() call without need cache invalidation…
4.
the one thing i missed in the road map is how optimizer state is handled between batches.
so essentially in adagrad_window inside updates.py, the gradient accumulators are created as
fresh new pytensor.shared zeros everytime the step_function() is compiled, so at the start of each batch the optimizer state reset’s automatically. A design choice: so lets say we train on batch x, AdaGrad start building up a picture of the gradient…
the question is
do we throw away this and recompile fit() every batch (SVB suggest the same, each batch should be an independent inference problem… )
but again the downside is compilation takes more time…
we can keep it and reuse it…
we can compile step_function() once and then call it repeatedly after each batch (its faster but not pure SVB)
@zaxtax and @fonnesbeck following up on my earlier roadmap with some corrections after digging into the codebase more carefully. Would love feedback on whether this revised understanding is correct before I finalize the proposal.
This looks good, but I don’t think we need to focus on a bugfix section. This project should be seen as an alternative implementation of the existing way Minibatching is done. This will allow you to get more done and not worry as much about introducing breaking changes.
I would also try to focus less on implementing a specific inference algorithm as much as creating scaffolding that makes it easy to implement many different inference algorithms.
Hey there @zaxtax
really appreciate the feedback
will make the changes based on what u have suggested!!
and will try focusing more on the scaffold - essentially a base class over which u can implement other algorithms
SVB and a NumPyro-style auto-scaling method will be the first two implementations to demonstrate it works.