GSoC 2026: Streaming Variational Inference for Large Datasets

Hi — I’m Yicheng Yang, a junior at UIUC (CS + Statistics + Economics), applying for GSoC 2026. I wanted to share my thinking on the streaming variational inference direction and get some early feedback.

Background

I take STAT 432 (Statistical Learning) at UIUC, which covers stochastic gradient methods, and I’ve been going through the Baruch Pre-MFE Numerical Linear Algebra seminar, where we cover L-BFGS and quasi-Newton convergence. The practical motivation comes from my own projects: I built a real-time trading system for prediction markets that ingests continuous data streams, and I maintain clawdfolio, a portfolio analytics package that has to process financial time series that regularly exceed memory on a standard machine. I know firsthand what it feels like when your analysis pipeline hits a memory wall.

What I’ve Explored So Far

I’ve been reading through pymc/variational/ to understand the existing infrastructure:

  • The MeanField approximation and the ADVI inference loop in inference.py use self.approx.logp_nojac, which calls into the PyTensor graph over the full dataset. There’s no batching at the ELBO level — the existing pm.Minibatch approach works at the data-indexing level but assumes the full array is pre-loaded.

  • Pathfinder uses L-BFGS to walk toward the posterior mode, then approximates the inverse Hessian from the trajectory. The optimizer is deterministic — no stochastic gradient support.

  • MinibatchRV and the Minibatch class in data.py are the natural extension points. The scaling factor (N / batch_size) is already computed; the missing piece is feeding batches from an iterator rather than random-sampling from a pre-loaded array.

Proposed Approach

The core idea: wrap an arbitrary Python iterator in a StreamingDataset object that handles the N-scaling problem and plugs into the existing pm.Minibatch infrastructure. From there:

  • Streaming ADVI — replace the full-data ELBO call with a scaled batch ELBO, consume data from the iterator, add a CUSUM-based convergence monitor (standard ELBO plateau checks don’t work well for streaming since the distribution can shift).

  • Streaming Pathfinder — adapt L-BFGS to stochastic gradients using the overlap-correction technique from Moritz et al. 2016 — two independent batches per step to get an unbiased curvature estimate.

Questions for Rob

  • On the Minibatch scaling: the current implementation assumes N is known. For streams with unknown total size, would you prefer an explicit approximate_n parameter, or an adaptive estimator that updates N online?

  • Architectural scope: is there appetite for modifying the Pathfinder optimizer directly, or would it be cleaner to keep Pathfinder untouched and build a parallel StreamingPathfinder class?

  • Stability under high gradient noise: are there known issues with the MeanField + Adam combination at very small batch sizes that I should account for?

These are great questions. I think we should follow numpyro’s lead and have one set of semantics for a known size and a different weighting if none is provided. Possibly with a warning.

Ideally this should be something that takes optimizers as input and is fairly generic otherwise. If this involves changing Pathfinder, you are empowered to do that.

As for specific tuning related to ADVI or specific optimisers, that is mostly out of scope.

Great questions! Keep them coming and since the deadline is coming up, try to get a draft to me for feedback sooner rather than later.

Thanks Rob, this is really helpful! Good to know about following numpyro’s semantics for the N-scaling — I’ll look at their implementation. And glad to hear modifying Pathfinder directly is
on the table.

I’ll send a draft to the review email shortly.