GSoC 2026: Online Sequential Updates for Bayesian State Space Models

Hi everyone — I’m Yicheng Yang, a junior at UIUC studying Computer Science, Statistics, and Economics. I’m applying for GSoC 2026 and wanted to introduce myself and share my early thinking on the online state space models project.

Background

I’m currently taking STAT 429 (Time Series Analysis) at UIUC, where we cover state space models and Kalman filtering directly, so this project landed at a good moment for me. Outside of coursework, I maintain clawdfolio, a portfolio analytics package on PyPI, and I built a real-time prediction market trading system with async streaming data pipelines — which is where I first ran into the pain point this project addresses: updating a running model efficiently as new data arrives, without replaying history.

What I’ve Explored So Far

I’ve been going through the pymc-extras statespace module. A few things I noticed:

  • The existing filters (StandardFilter, UnivariateFilter, SquareRootFilter) are well-built and cover the numerical stability concerns I’d expect (Cholesky-based updates, univariate processing for large observation vectors).

  • The KalmanSmoother is backward-pass only — there’s no forward-only “filter state checkpoint” that a user could serialize and resume from.

  • The PyMCStateSpace base class in core/statespace.py wraps everything in a PyTensor graph, which is great for MCMC but makes incremental updates tricky since the graph assumes the full data sequence is known at compile time.

The gap I see: once you’ve run MCMC offline and want to track new observations in real time, there’s no supported path. You either re-run full MCMC (minutes) or manually extract NumPy arrays from the filter and do updates yourself, losing all the probabilistic infrastructure.

Proposed Approach

The approach I have in mind builds on top of what’s already there rather than replacing it:

  1. FilterState dataclass — captures the sufficient statistics (filtered mean, covariance, log-likelihood) at any time step; essentially a checkpoint that can be serialized and resumed.

  2. OnlineStateSpaceModel wrapper — exposes a model.update(new_obs) API that runs just the predict → update step from the existing Kalman filter code, without replaying history.

  3. Hyperparameter uncertainty via SIR — sequential importance resampling over posterior samples from the original MCMC trace, with ESS-triggered resampling.

  4. JAX-compiled fast path — for the common case (fixed hyperparameters, Gaussian observations) to get per-step latency under 10 ms.

Questions for Jesse & Jonathan

  • Is there an architectural reason the PyTensor graph needs the full sequence at compile time, or is this something that could be relaxed for the online case?

  • For the SIR particle step — would you prefer this stay in pure NumPy (simpler, more debuggable) or should it go through PyTensor as well for consistency with the rest of the module?

  • Are there known numerical issues with the existing filters at longer time horizons that I should account for in the online update design?


Happy to share a full draft proposal or a notebook exploring the statespace module. Would love any feedback on whether this direction makes sense.

Hi, @YichengYang !

I’m glad to see that you are interested in the project. It certainly is a good first step to go through the SSM module and look at what is already available. We also encourage you to go through the module as a user and get hands-on with leveraging the module. There are some really good notebooks in the `pymc-extras` repository that walks you through leveraging the tool.

You are correct that there is no supported path when you want to track new observations in real time after running MCMC offline. We are hoping to fill that gap with this project :).

I will try to the best of my abilities to answer your questions below:

  1. PyTensor doesn’t need the full sequence at compile time. However, posterior samples produced by MCMC are tied to the exact dataset used to generate them. This is why PyMC (and most probabilistic programming languages built on MCMC) need to be re-sampled using all the data if the data changes. Now, as you have alluded to, there are different paths we can take to circumvent that limitation. For example, it is possible to leverage analytical solutions through exponential family conjugacy or we could use sequential Monte Carlo (SMC) and re-weight the particles as new data arrives. It is also reasonable to update filter states at a shorter time interval for online predictions and only resample the full posterior at a longer time interval for hyperparameter updates.

  2. We would most likely want to port this into PyTensor or leverage the SMC module that already exists.

  3. No, not to my knowledge there isn’t.

The broad strokes of how I would proceed are:

  1. Using the existing kalman filter code, write a harness that compiles a one-step update to the filter state, given a “frozen” model. That is, we don’t concern ourselves with learning the “deep” parameters of the model, only the latent state mean and covariance.
  2. Do some experimentation and come up a good API for storing the KF state. So once we have that one-step update, how are we handling read/write? The KF is markov so we only need the last state, but where is the history going
  3. We will also likely want a way to make forecasts at every time step. So we need machinery to compile and store that function. Then there’s a question of how many steps ahead, and how to be handle the bi-temporal database this implies (what was the model’s view on timestep 134 when it was at timestep 122 vs when it was at timestep 130)
  4. If a user has a fit a bayesian SSM, he will have an idata object. This will need to be attached to the online KF as an input to the one-step update. That’s going to be expensive, what can be done there?
  5. How do we handle refitting the deep parameters of the model (e.g. getting a new idata) given the new data that has come in. That probably means we were storing the live data, so where is that going? How often to retrain? The system cannot go offline during training, so we will have to spawn a separate thread to do that. When it’s done we need to handle the the change-over from the old idata to the new idata. Do we go back and replay the entire tape to get new historical view, or just start from where we are when the new idata comes in?
  6. Also the statespace models we currently have in pymc-extras assume evenly-spaced timesteps. In an online regime this is definitely not going to be the case, so we need models and tools around continuous time.

I think a realistic GSoC project would first do (6), then do (1). The rest is a huge undertaking with lots of details. I am not interested in babysitting someone with an LLM; I assure you Claude and I can do this project ourselves. The purpose is for someone to take the time to master these models and statistical techniques, and become deeply familiar with pymc/pytensor.

1 Like

Number 6. is a really good point Jesse. I think it could also be possible to have machinery that bins the observation to a particular discrete time-point. For example, we would predict and update at a frequency of 5 minutes (or whatever granularity makes sense to the data at hand).

I second the sentiment that the purpose of this project is to allow interested candidates to learn the models, statistical methods, and to become familiar with the PyTensor/PyMC ecosystem.

I didn’t address SMC, but particle samplers are also interesting and if that’s where your passions lie, I wouldn’t object to making the project about that. This would then turn into more a project about estimation of fully non-linear, non-guassian statespaces. We would need helpers related to that as well. @lucianopaz did some really great work on a particle sampler draft in the past that could be used as a nice starting point. We didn’t end up running with it at the time, but I would love to see the effort re-alivened.

Be aware though that this has nothing to do with online per-se though. The bookeeping even becomes harder, because now you have to track two sets of particles: one for the input parameters (the idata) and one for the hidden state that is being estimated (the particles). If non-linear is important, it might be easier to start with EKF via autodiff. But the question of how to represent a non-linear system in the existing PyMCStateSpace machinery is an interesting one.

And, to be clear, the existing pm.sample_smc isn’t the same thing as a particle sampler in the statespace context (though they have the same name and the same statistical techniques). One is trying to learn the deep parameters of a model and the other is trying to estimate the mean and covariance of latent states in a dynamical system. It might be possible to roll this all together and do it jointly, but the details on how to do that are not obvious to me.

My vision for this is a statespace.continuous_components module that would re-implement the statespace building blocks in their ODE forms, then use a Van Loan scheme to discretize in a way that \Delta t is always an input to the model. It would end up giving something similar to what you did in the hurricane notebook, but worse performance because it is generalized to any SSM. The online step then would be new_mean, new_cov = online_step(observations, time_since_last_observation, **frozen_parameteres)

We might be able to do something smart with rewrites to get rid of the matrix exponentials in cases where we know the closed-form answer.

Yeah, that makes sense and it will be more robust to just sampling the data into a fixed frequency.

You are also correct in that there is going to be quite a bit of data that needs to be tracked/stored and we will need to explore efficient storage solutions so that this will scale. I think I am more concerned with storing posterior/posterior predictive samples than the live data. The live data can usually be replayed from a streaming platform like Kafka. As far as I know our options for xarray would be netcdf and zarr. NetCDF can get quite large, I think Zarr is generally more efficient?

1 Like

Thanks for the detailed breakdown Jesse, this is really helpful for scoping.

I think the most natural starting point for me would be online KF updates for linear Gaussian SSMs — getting the online_step(observations, time_since_last_observation, frozen_parameters) API right, with proper bookkeeping for the sufficient statistics. The continuous-time components with Van Loan discretization sounds like the right way to handle irregular observations, and it fits well with what I had in mind for the FilterState dataclass.

For the non-linear side, EKF via autodiff feels like the more tractable path as a second step — it reuses most of the online KF infrastructure and avoids the particle bookkeeping complexity you mentioned. I’d be happy to explore that once the linear case is solid.

I’m also curious about @lucianopaz’s particle sampler draft — would love to look at it for context even if it’s not the main focus.

On the storage question Dekermanjian raised, zarr does seem like the better fit for streaming posterior samples since it supports chunked append operations natively, unlike netcdf which is more batch-oriented.

I’ve also opened a PR on PyTensor (#1996: slogdet rewrite for block diagonal matrices) to start getting familiar with the graph rewrite infrastructure.