Examples for SMC with streaming data?

mike-lawrence · November 1, 2023, 5:18pm

I have timeseries data that I’d like to model with pm.sample_smc() by first getting the posterior given the data at timepoint 1, then pass that posterior as inits for a next run that has data for timepoints 1-3, then repeat until the model has “seen” all the data. Does anyone have any existing code to facilitate this workflow?

ricardoV94 · November 1, 2023, 5:26pm

No but I’ve wanted to test that for a while. How fast is it?

Would be good to avoid recompiling the same functions in every call (if you set the data as mutable)

mike-lawrence · November 1, 2023, 6:22pm

Working on my own code now; is there a straightforward way to extract unconstrained representation of draws from a trace object?

mike-lawrence · November 1, 2023, 11:29pm

nm, figured out the key was idata_kwargs={'include_transformed': True}

mike-lawrence · November 7, 2023, 10:33am

Sneak peek at progress:

Photoplethsymography data; uncertainty sufficiently low/accurate that the green true-value line largely hides the estimate uncertainty ribbon

ricardoV94 · November 7, 2023, 12:15pm

That’s really cool!

mike-lawrence · December 11, 2023, 7:44pm

So the sampling approach described above involves sampling an increasing volume of data, but a true streaming SMC would never see the same data twice. A naive implementation of true SSMC would start with sampling the model given the data at the first timepoint, then use the resulting posterior as inits for a second sampling run given the data at the second timepoint, and so on. However, while the first sampling run should include the priors when incrementing the logp, I think the subsequent runs should be omitting the priors contribution to the logp (correct me if I’m wrong!), and I’m not quite sure how easy it is to achieve that in PyMC. I’ve tried defining my own distribution classes with a logp method that returns zero for use on the second-and-onwards sampling runs, but I suspect I’m doing something wrong as the performance in recovering data-generating parameters is terrible. Thoughts?

cluhmann · December 11, 2023, 10:31pm

Can you say more about why? Also, in iterative updating/online/streaming settings, it can get pretty confusing to talk about “the priors” because the original priors before data arrived are quickly no longer “the priors” anymore.

mike-lawrence · December 11, 2023, 11:14pm

Precisely this:

That is, as I understand things, sampling of the model given the first data point (talking timeseries data here) should be informed by the priors (priors in the “I haven’t seen any data yet, just using domain expertise” sense), which yields a posterior (posterior_0) that reflects an update from the priors given the single data point. This posterior is then passed as starting points for the next round of sampling that includes the second data point, and since posterior_0 already contains the information/influence of the prior, sampling for the second data point and onwards shouldn’t add the prior yet again in the log-likelihood.

cluhmann · December 11, 2023, 11:42pm

Half of this seems correct. Once you update, your posterior becomes your prior moving forward.

It’s not clear to me what you mean by “starting points” here. You need to be using the posterior as your new prior, exactly as you used the prior in the first place. This notebook discusses and PyMC experimental now how two methods for constructing “non-parametric” priors (this and this).

mike-lawrence · December 11, 2023, 11:53pm

Literal facepalm here on realizing I’d been thinking about this wrong. Thanks for steering me back in the right direction!

mike-lawrence · December 12, 2023, 12:42am

I wonder if the pymc devs have seen the new work on defining priors via quantiles, which seems perfect for using a posterior as a prior (well, not perfect, as the method described in that paper would turn the multivariate posterior into a series of univariate priors, thereby losing information).

cluhmann · December 12, 2023, 4:20am

PyMC’s find_contrainted_prior() function is somewhat related.

Topic		Replies	Views
Using the starts argument to sample_smc() v5	1	257	November 2, 2023
Using MCMC based model in closed loop Questions	14	2486	July 29, 2019
Simulating new data points while having data-dependent priors Questions	0	448	May 10, 2019
Pause and resume sampling while changing the model v5	5	689	March 10, 2023
Online update as new data comes in Questions development , modeling	4	1765	March 7, 2022

Examples for SMC with streaming data?

Related topics