I am attempting to use PyMC3 to model the 2018 Mississippi special Senate election using a beta regression model.
My dataset consists the %age of vote won by the Democrat and the %age turnout by county (2 separate models) in each round of the election, along with various demographic factors for each county. My goal was to use the demographic and 1st round election data, along with partial results from the 2nd round, to generate a forecast of the final 2nd round result (similar to the NYT needle). My frequentist model performed poorly on election night, so I am building a retrospective model in PyMC3 to see if it would have done better.
In order to decide between using simple regression and weighted regression (weights = county populations), I used Kruschke’s Bayesian model comparison with the first round data to decide between using simple regression and weighted regression (weights= county populations). However, my posterior probability on the unweighted model was 100% for predicting the margin and 0% for predicting the turnout; these results are intuitively implausible so I’m concerned I made a mistake.
- Is this an appropriate use of Bayesian model comparison?
- Did I do it correctly? My code for the turnout model is below; the margin model has an identical structure but different data
#avoid data snooping by dropping first round data
xTurnout1 = turnoutPredictors.drop(['TotalPop', '11/6/18', '11/6 Dem %', 'GOP %McDaniel'],axis=1)
yTurnout1 = turnoutPredictors['11/6/18']
n_predictors_Turnout1 = len(xTurnout1.columns)
predictor_names = list(xTurnout1.columns)
with pm.Model() as prelimTurnoutModel:
#model comparison
m = pm.Categorical('m',[np.asarray([.5,.5])])
#the hyperprior distribution for the mean of t-distribution
muB = pm.Normal('muB', 0, 1)
#the hyperprior on the variance of the t-distribution
#replacing gammas per Gelman; want heavy tails here b/c uncertain
tauB = pm.HalfCauchy('tauB', 1)
tdfB = pm.HalfCauchy('tdfB', 1)
# define the priors
#the mean y value
#even though this is an uninformative prior, tau should be high because we know the mean margin will be between 0 and 1
beta0 = pm.Normal('beta0', mu=0, tau=10)
#the regression coefficients
beta1 = pm.StudentT('beta1', mu=muB, lam=tauB, nu=tdfB, shape=n_predictors_Turnout1)
mu = beta0 + pm.math.dot(beta1, xTurnout1.values.T)
#affects <1% of sample values
mu_clipped = T.clip(mu,.0000001,.9999999)
#a scale parameter ("sample size") for the beta distribution
kappa_log = pm.Exponential('kappa_log', lam=1.5)
kappa = pm.Deterministic('kappa', T.exp(kappa_log))
#nullpopweights = 1 for every county, popweights = the county population / mean county population
omega = pm.math.switch(T.eq(m,0),T.as_tensor(nullpopweights),T.as_tensor(popweights))
kappa_w = omega * kappa
alphaY = mu_clipped * kappa_w
betaY = (1-mu_clipped)*kappa_w
yl = pm.Beta('yl', alpha=alphaY, beta=betaY, observed=yTurnout1)
trace = pm.sample(2000, tune=2000, cores=4, nuts_kwargs={'target_accept': 0.95} )
Apologies if this is an inappropriate question, or I’ve included too much/too little information (please let me know)