Combining pre-fit model with new observations and forecasting in PyMC

I’m looking for guidance on the proper approach in PyMC to:

  1. Use a model fit on historical data

  2. Update it with new observations

  3. Generate forecasts for future dates

Election Forecasting Context

I’m building an election forecasting model (as described here) with these components:

  • Historical Data: Past elections with polling data, results, and other covariates

  • Current Election: Real polling data is arriving periodically before the election

  • Forecast Need: Predict party vote shares up to election day

My model captures polling dynamics including:

  • Time-varying party support

  • Pollster house effects

  • Gaussian process components for temporal dynamics

  • Other factors like incumbency

Current Challenge

I’ve successfully:

  • Fit the model using historical election data

  • Obtained posterior distributions for all parameters

Now I need to:

  1. Incorporate new polls for the current election as they become available

  2. Generate forecasts for future dates up to the election

  3. Properly quantify uncertainty that reflects:

  • High certainty at dates with real polls

  • Increasing uncertainty as we move away from observed polls

Specific Questions

  1. What’s the proper way in PyMC to incorporate new observations (polls) when making forecasts with a pre-fit model?

  2. Should I create a separate “forecast model” that uses the posterior from my training model as priors?

  3. How do I ensure that my real polls properly constrain the forecast uncertainty (so that uncertainty is minimal at real poll dates and grows as we move away from them)?

  4. Is there a standard approach in PyMC for this kind of “update with new data + forecast” problem?

Any examples or guidance would be greatly appreciated. I’m specifically interested in understanding the right architectural approach, rather than specific implementation details.

cc @AlexAndorra @ricardoV94 @awalters who have helped me in the past :slight_smile:

1 Like

There is some discussion here about how to use posteriors from one model as priors in the next:

though I don’t know what the best practice is but this topic has some detailed discussion and references to literature. It also seems there is already some implementations of what was discussed there in the pymc_experimental lib:

1 Like

Hi @bernardocaldas , and well done on the new model!
Unless you’re using state space models for the time series part, the workflow shouldn’t be any different than a classic PyMC model, so you should be able to use the set_data function.

There are numerous examples of it on the website – basically, any example that shows how to do out-of-sample predictions should help you.
In particular, if you’re still using my electoral forecasting models as a basis, I had written about it on my blog. The recent HSGP tutorials I co-wrote with Bill show it too.

Hope this helps, and PyMCheers :vulcan_salute:

Hey @AlexAndorra !

I guess the main question is how to add additional observations. Does replacing the observed_polls data with the recently added polls + doing posterior sampling do anything to condition the posterior to the new polls?

This is not how polls work. We have very high uncertainty even on poll day due to several factors, including simple sampling variance (we only measure a tiny fraction of the population, so the result is high uncertainty; there is differential non-response (who responds to a poll depends on what’s been going on; there is often uncertainty in our attempts to adjust non-representative polls to the general population).

This is usually a side-effect of a time-series model. Sounds like you’re using a GP, where this will happen naturally. You’ll have to be careful to calibrate the covariance kernel to one that makes sense for this.

How do you fit that effect and adjust for the differences among polls?

The bigger effect is differential non-response.

You could use something like Sequential Monte Carlo (SMC), or you could just refit the whole model. For local drifts, you can use importance sampling like in LOO.

Don’t do that artificially. And make sure to account for the very high uncertainty of the data from a given poll.

If you can write a single model that can accommodate any number of polls, it shouldn’t be a problem updating it and refitting it as new data come in.

2 Likes

Ah if you want to update the posterior parameters, then you need to re-sample the model, and the easiest programatically will be to just do it on the full data, including the new polls.

If you just swap in the new polls instead of the old polls you used to sample, then do sample_posterior_predictive, this will give you predictions of course, but they will be conditioned on parameters learned during sampling with old polls.

As Bob was saying, the GP will do that automatically