Examples for SMC with streaming data?

I have timeseries data that I’d like to model with pm.sample_smc() by first getting the posterior given the data at timepoint 1, then pass that posterior as inits for a next run that has data for timepoints 1-3, then repeat until the model has “seen” all the data. Does anyone have any existing code to facilitate this workflow?

1 Like

No but I’ve wanted to test that for a while. How fast is it?

Would be good to avoid recompiling the same functions in every call (if you set the data as mutable)

1 Like

Working on my own code now; is there a straightforward way to extract unconstrained representation of draws from a trace object?

nm, figured out the key was idata_kwargs={'include_transformed': True}

1 Like

Sneak peek at progress:

Photoplethsymography data; uncertainty sufficiently low/accurate that the green true-value line largely hides the estimate uncertainty ribbon :muscle:


That’s really cool!

1 Like

So the sampling approach described above involves sampling an increasing volume of data, but a true streaming SMC would never see the same data twice. A naive implementation of true SSMC would start with sampling the model given the data at the first timepoint, then use the resulting posterior as inits for a second sampling run given the data at the second timepoint, and so on. However, while the first sampling run should include the priors when incrementing the logp, I think the subsequent runs should be omitting the priors contribution to the logp (correct me if I’m wrong!), and I’m not quite sure how easy it is to achieve that in PyMC. I’ve tried defining my own distribution classes with a logp method that returns zero for use on the second-and-onwards sampling runs, but I suspect I’m doing something wrong as the performance in recovering data-generating parameters is terrible. Thoughts?

Can you say more about why? Also, in iterative updating/online/streaming settings, it can get pretty confusing to talk about “the priors” because the original priors before data arrived are quickly no longer “the priors” anymore.

Precisely this:

That is, as I understand things, sampling of the model given the first data point (talking timeseries data here) should be informed by the priors (priors in the “I haven’t seen any data yet, just using domain expertise” sense), which yields a posterior (posterior_0) that reflects an update from the priors given the single data point. This posterior is then passed as starting points for the next round of sampling that includes the second data point, and since posterior_0 already contains the information/influence of the prior, sampling for the second data point and onwards shouldn’t add the prior yet again in the log-likelihood.

Half of this seems correct. Once you update, your posterior becomes your prior moving forward.

It’s not clear to me what you mean by “starting points” here. You need to be using the posterior as your new prior, exactly as you used the prior in the first place. This notebook discusses and PyMC experimental now how two methods for constructing “non-parametric” priors (this and this).

1 Like

Literal facepalm here on realizing I’d been thinking about this wrong. Thanks for steering me back in the right direction!

I wonder if the pymc devs have seen the new work on defining priors via quantiles, which seems perfect for using a posterior as a prior (well, not perfect, as the method described in that paper would turn the multivariate posterior into a series of univariate priors, thereby losing information).

PyMC’s find_contrainted_prior() function is somewhat related.