Help with MvStudentT/MvNormal political-election model

In light of the upcoming Israeli elections (now almost a month away), I created a model to analyze the polls.The code is now available at https://github.com/byblian/pyhoshen and you can see example results of the ‘polls-only’ model at https://twitter.com/pyHoshen

There is also a correlation distributions graph I made, that may be of interest in general:

My main problem at this time is that the code takes ~3+ hours to run on Google Colab. I am not sure if that is due to the difficulty of the model given the data (16 parties, ~10 pollsters, ~30 polls over the last 30 days), to problems in the model (this is my first pymc3 model), or to the performance of Google Colab (I don’t know how it will perform on other platforms). I am assuming all of the above.

My background is in software, not statistics. I have spent quite a lot of time exploring various parameterizations but nothing really ran fast at any time. At best, it got to 1-5 iterations/s. My main concern has been so far focused on ensuring it converges appropriately (within Google’s 12-hour timeframe).

So I would appreciate help and advice regarding the polls-only model, which converges but slowly (3 hours) :slight_smile:

The ‘mult-mean-variance’ model which attempts to model house effects as a per-pollster per-party coefficient multiplied on the mean with an additional per-pollster variance has now taken 10 hours to run, almost converged, and is still problematic. I don’t know yet how well the ‘variance’ model will perform.

The basic definitions of the model is in models.py, which I tried to document extensively.
The class breakdown might be a little complicated but overall, it is really simple conceptually: a random walk going back from the forecast day, with polls modeled as a MvStudentT and the innovations modeled as MvNormal, all with the same basic cholesky matrix (16x16).

I do not have a public notebook example yet, but an example usage is as follows:

import datetime
import pymc3 as pm
from pyhoshen import israel
forecast_day = datetime.datetime.fromordinal(datetime.date.today().toordinal())
with pm.Model() as model:
  election = israel.IsraeliElectionForecastModel(
      'https://drive.google.com/uc?id=1WYGgC3LeTkwKz0Oc2IYX5OdnyiroSA9P',
      model_type='polls-only', base_elections=[],
      forecast_day=forecast_day, eta=25)

The results can then be plotted as follows:

import theano
theano.config.compute_test_value = 'off'
bo=election.compute_trace_bader_ofer(samples['support'], threshold=0.0325)
election.plot_mandates(samples, bo, hebrew=False)
election.plot_party_support_evolution_graphs(samples, bo, hebrew=False)
  1. json document with configuration (this is the file provided in the above code):
    https://drive.google.com/open?id=1WYGgC3LeTkwKz0Oc2IYX5OdnyiroSA9P
  2. google spreadsheet with polls data -
    https://docs.google.com/spreadsheets/d/1lqqrIp_sXir_Sz_H_y3mCvOaXGc_lmSsPaD-Y0C9eBk
1 Like

It is really difficult to read the model as it is autogenerated from code - I think it would helps a lot if you can write down the pm.Model block explicitly.
1-5 iterations/s is really not too bad for complex model, as long as there is no divergence or effective sample size warning, I would find it OK.

A converging model ran at 1-2 seconds, and then only after it converged and started mixing. During the exploration, it could take upwards of 150 seconds per iteration.

The following is the model, removing the class breakdown.
(Running this now to verify it generates the same results):

import datetime
import pymc3 as pm
import numpy as np
import theano.tensor as T
import itertools

forecast_day = datetime.datetime.fromordinal(datetime.date.today().toordinal())

from pyhoshen import configuration
config = configuration.Configuration('https://drive.google.com/uc?id=1WYGgC3LeTkwKz0Oc2IYX5OdnyiroSA9P')

cycle = max(config['cycles'])
cycle_config = config['cycles'][cycle]

parties = cycle_config['parties']
party_unions = { p: parties[p]['union_of'] for p in parties if 'union_of' in parties[p] }
for composite, components in party_unions.items():
    for c in components:
        if c != composite and c in parties:
            del parties[c]

for p in [ id for id, party in parties.items() if 'dissolved' in party ]:
    del parties[p]

config.read_polls(cycle_config, {'polls0': cycle_config['polls'][0]})

# only model support going back this amount of days
# earlier polls are ignored
total_days = 35

from pyhoshen import polls
election_polls = polls.ElectionPolls(
    config.dataframes['polls']['polls0'],
    parties.keys(), forecast_day, max_days = total_days)
  
test_results = [ np.nan_to_num(f) for f in election_polls.get_last_days_average(10)]

num_parties = len(parties.keys())

with pm.Model() as model:
  
  eta = 25
  cholesky_pmatrix = pm.LKJCholeskyCov('cholesky_pmatrix',
      n=num_parties, eta=eta,   
      sd_dist=pm.HalfCauchy.dist(0.1, shape=[num_parties]))
  cholesky_matrix = pm.Deterministic('cholesky_matrix',
      pm.expand_packed_triangular(num_parties, cholesky_pmatrix))
  
  votes = pm.Uniform('votes', 0, 0.5, shape=num_parties)
  
  innovations = pm.MvNormal('innovations',
      mu=np.zeros([total_days, num_parties]),
      chol=cholesky_matrix,
      shape=[total_days, num_parties],
      testval=np.zeros([total_days, num_parties])) 
  
  walk = pm.Deterministic('walk', T.cumsum(innovations, axis=0))
  
  support = pm.Deterministic('support', votes + walk)
  
  # group polls by number of days
  group_key = lambda poll: poll.num_poll_days
  
  grouped_polls = [ (num_days, [p for p in polls]) for num_days, polls in
      itertools.groupby(sorted(election_polls, key=group_key), group_key) ]
  
  # compute average of modeled support for multi-day polls
  def expected_poll_outcome(p):
      if p.num_poll_days > 1:
          return T.mean([ support[d] for d in range(p.end_day, p.start_day + 1)], axis=0)
      else:
          return support[p.start_day]

  likelihoods = [ pm.MvStudentT('polls_%d_days' % num_days,
    nu=[ p.num_polled - 1 for p in polls ],
    mu=[ expected_poll_outcome(p) for p in polls ],
    chol=cholesky_matrix / np.sqrt(num_days),
    testval=test_results,
    shape=[len(polls), num_parties],
    observed=[ p.percentages for p in polls ]) for num_days, polls in grouped_polls ]

with model:
  samples = pm.sample(1000, tune=1000,
                        nuts_kwargs=dict(target_accept=.8, max_treedepth=25, integrator='three-stage'))

A quick look at the model, I would suggest:

  1. try to replace the for loop in the likelihood. That usually helps
  2. eta=25 and large dimension of the LKJ makes it quite close to identity matrix - maybe you can just replace it with a identity (i.e., use Normal as prior instead of MvNormal)
  1. Unfortunately, I cannot replace it because I need to pass a scaled cholesky matrix appropriately. I compare multi-day polls to the average of the modeled support over the relevant days. Following up on this stackexchange post:

Assuming pt_1, pt_2, and pt_3 are independent of one another, Var(\overline{x})=Var(\frac{x_1+x_2+x_3}{3})=\frac{1}{9}\left[Var(x_1)+Var(x_2)+Var(x_3)\right]=\frac{1}{9}\left[s_{x_1}+s_{x_2}+s_{x_3}\right]

Since I assume a constant covariance matrix across all days, my understanding is that this translates to \frac{1}{3}\left[Var(x)\right], and the corresponding cholesky to \frac{1}{\sqrt{n}}L. It would really be nice if I’m mistaken somewhere and I could just use the same cholesky. But based on these understanding, I group the polls by number of days - there are 1-day polls, 2-day polls, and 3-day polls (perhaps 4-day polls), and create a MvStudentT likelihood for each relevant series of polls. There are only a handful but again, it would be nice if I’m mistaken and could just use the same cholesky or alternatively, if the cholesky matrix could be passed in a variant way or at least to pass a scale array/matrix which could be computed appropriately for all polls together leading to one MvStudentT likelihood that encapsulates everything (this might also make it easier to consider models that varied the sd of the cholesky based on the pollster, for example).

  1. The article I started with used eta=50 (and a “shock” eta=100 for election-day results) with number of parties ~4. I have since reduced it to 25 but have no idea what is a good number. The idea is to suggest that there are a few connections between parties. In practice (as you can see in the above tweet I posted, where I think the eta was still 50), there are differences. We see Jewish Home and New Yamin parties have a very red entry and Taal and Joint List also have a very red entry (indicating -0.3 or below, and in practice these two entries are much lower), while United Torah Judaism and Shas have a very green entry. The first two pairs represent parties that split apart at that time (New Yamin split apart from Jewish Home, and Taal split apart from the Joint List), and thus the supporters of the original party got divided between the two resultant parties. The last represents two ultra-orthodox parties that appeal to a similar demographic. Some of the other entries are false findings because the parties involved have close to 0% support. So in practice, the polls and election dynamics represents connections between the parties, some of which can be recovered through the campaign. It is important for me to recover these connections so that the final distribution of votes between the parties reflects these connections and doesn’t assume the support for the parties is not co-related at all.