Hey everyone, new PyMC3 user here. I work mostly with machine learning and I’ve been trying to learn as much as I can about probabilistic programming in my free time. Anyway, I put together a model for the Georgia run off election in the United States based on the partial pooling baseball example on the website, and I’m curious to know if it seems reasonable to the pros on here.
My idea is that the true vote share for the state is unobservable, but each poll can give us a glimpse. However each poll should have its own distribution for house effects or sample bias or whatever. Similar to the Efron and Morris baseball example where each player has their own distribution that informs the distribution of the population of professional baseball players.
The data
samplesize = [605, 1250, 500, 800, 1377, 1450, 583, 1064, 300, 1500]
num_votes = [296, 631, 247, 404, 703, 717, 312, 499, 143, 734]
And the model
with pm.Model() as warnock_model:
phi = pm.Beta('phi', alpha=alpha, beta=beta)
kappa_log = pm.HalfNormal('kappa_log', sigma=1)
kappa = pm.Deterministic('kappa', tt.exp(kappa_log))
thetas = pm.Beta(
'thetas',
alpha=phi*kappa,
beta=(1.0-phi)*kappa,
shape=len(num_votes)
)
y = pm.Binomial(
'y',
n=samplesize,
p=thetas,
observed=num_votes
)
This model seems reasonable to me. Plotting the forest plot for each pollster with the actual poll result in orange looks good.
And the posterior taking phi to be estimated state vote share looks realistic to me as well.
So my questions now are what types of posterior predictive checks should I do?
Then if I wanted to weight the polls by time what is the best way to do that? My thinking would be artificially decrease the sample size by some function, that should increase the uncertainty for those polls that are far away from the election date? However I’m not sure if that is a commonly accepted practice.