I am currently trying to implement the following model:
I have a number of political experts, who in the past have made estimation about the outcome of elections.
I assume each expert has an accuracy/edge which makes his prediction better than a random guess.
I have built a model, which allows me to both estimate an experts accuracy based on past events, and make inference about the outcome of a future event of which only the experts’ predictions are known to the model.
While the model can reconstruct the experts’ accuracy rather well, the implied probabilities for future events are not useful and the posterior looks way off what I would expect given the model.
This is a working minimum example:
# Imports import matplotlib.pyplot as plt import numpy as np import pandas as pd import pymc3 as pm import seaborn as sns # Experts and Events n_train_events = 10 n_test_events = 2 n_events = n_train_events + n_test_events event_outcomes = np.random.binomial(1, 0.5, n_events) n_experts = 10 n_observations = n_events * n_experts # use a Beta (2,2) expert_accuracy = np.random.beta(2, 2, n_experts) # Dummy Dataset expert_event_df = pd.DataFrame(columns=['expert_id', 'event_id', 'expert_pred', 'true_outcome'], index=np.arange(n_observations)) i = 0 for event_idx in range(n_events): event_outcome = event_outcomes[event_idx] correct_pred = np.random.binomial(1, expert_accuracy) random_guess = np.random.binomial(1, 0.5 * np.ones(n_experts)) event_pred = correct_pred * event_outcome + \ (1 - correct_pred) * random_guess idx_slice = expert_event_df.index.isin(expert_event_df.index[i:i+n_experts]) expert_event_df.loc[idx_slice, 'expert_id'] = np.arange(n_experts) expert_event_df.loc[idx_slice, 'event_id'] = event_idx expert_event_df.loc[idx_slice, 'expert_pred'] = event_pred expert_event_df.loc[idx_slice, 'true_outcome'] = event_outcome i+= n_experts # PyMC3 Model expert_lookup = expert_event_df['expert_id'].astype(int).values event_lookup = expert_event_df['event_id'].astype(int).values observed_pred = expert_event_df['expert_pred'].values # Test Event Outcomes are masked so that they are sampled obs_outcomes = event_outcomes.copy() obs_outcomes[-n_test_events:] = -1 masked_data = np.ma.masked_array(obs_outcomes, obs_outcomes == -1, fill_value=-1) with pm.Model() as model: pm_expert_accuracy = pm.Beta('pm_expert_accuracy', alpha=1.0, beta=1.0, shape=n_experts) pm_event_prob = pm.Beta('pm_event_prob', alpha=1.0, beta=1.0, shape=n_events) pm_event_outcome = pm.Bernoulli('pm_event_outcome', p=pm_event_prob, shape=n_events, observed=masked_data) broadcast_expert_accuracy = pm_expert_accuracy[expert_lookup] broadcast_true_outcome = pm_event_outcome[event_lookup] pm_expert_pred = pm.Deterministic('pm_expert_pred', broadcast_expert_accuracy * (broadcast_true_outcome - 0.5) + 0.5) expert_predictions = pm.Bernoulli('expert_predictions', p=pm_expert_pred, shape=n_observations, observed=observed_pred) # Sampling with model: trace = pm.sample(10000) burned_trace = trace[1000:]
The problem is now with the posterior samples for pm_event_outcome_missing, so for the events whose outcomes I have masked. Even if I modify the dataset in a way that the experts’ estimates for the two events are completely random, the posterior distribution is always almost singular around 0 or 1.
Is there any obvious thing I am missing? Any advice on how to achieve a model which is better at reflecting uncertainty?
Thank you very much and kind regards.