I’m looking for guidance on the proper approach in PyMC to:
Use a model fit on historical data
Update it with new observations
Generate forecasts for future dates
Election Forecasting Context
I’m building an election forecasting model (as described here) with these components:
Historical Data: Past elections with polling data, results, and other covariates
Current Election: Real polling data is arriving periodically before the election
Forecast Need: Predict party vote shares up to election day
My model captures polling dynamics including:
Time-varying party support
Pollster house effects
Gaussian process components for temporal dynamics
Other factors like incumbency
Current Challenge
I’ve successfully:
Fit the model using historical election data
Obtained posterior distributions for all parameters
Now I need to:
Incorporate new polls for the current election as they become available
Generate forecasts for future dates up to the election
Properly quantify uncertainty that reflects:
High certainty at dates with real polls
Increasing uncertainty as we move away from observed polls
Specific Questions
What’s the proper way in PyMC to incorporate new observations (polls) when making forecasts with a pre-fit model?
Should I create a separate “forecast model” that uses the posterior from my training model as priors?
How do I ensure that my real polls properly constrain the forecast uncertainty (so that uncertainty is minimal at real poll dates and grows as we move away from them)?
Is there a standard approach in PyMC for this kind of “update with new data + forecast” problem?
Any examples or guidance would be greatly appreciated. I’m specifically interested in understanding the right architectural approach, rather than specific implementation details.
There is some discussion here about how to use posteriors from one model as priors in the next:
though I don’t know what the best practice is but this topic has some detailed discussion and references to literature. It also seems there is already some implementations of what was discussed there in the pymc_experimental lib:
Hi @bernardocaldas , and well done on the new model!
Unless you’re using state space models for the time series part, the workflow shouldn’t be any different than a classic PyMC model, so you should be able to use the set_data function.
I guess the main question is how to add additional observations. Does replacing the observed_polls data with the recently added polls + doing posterior sampling do anything to condition the posterior to the new polls?
This is not how polls work. We have very high uncertainty even on poll day due to several factors, including simple sampling variance (we only measure a tiny fraction of the population, so the result is high uncertainty; there is differential non-response (who responds to a poll depends on what’s been going on; there is often uncertainty in our attempts to adjust non-representative polls to the general population).
This is usually a side-effect of a time-series model. Sounds like you’re using a GP, where this will happen naturally. You’ll have to be careful to calibrate the covariance kernel to one that makes sense for this.
How do you fit that effect and adjust for the differences among polls?
The bigger effect is differential non-response.
You could use something like Sequential Monte Carlo (SMC), or you could just refit the whole model. For local drifts, you can use importance sampling like in LOO.
Don’t do that artificially. And make sure to account for the very high uncertainty of the data from a given poll.
If you can write a single model that can accommodate any number of polls, it shouldn’t be a problem updating it and refitting it as new data come in.
Ah if you want to update the posterior parameters, then you need to re-sample the model, and the easiest programatically will be to just do it on the full data, including the new polls.
If you just swap in the new polls instead of the old polls you used to sample, then do sample_posterior_predictive, this will give you predictions of course, but they will be conditioned on parameters learned during sampling with old polls.
As Bob was saying, the GP will do that automatically