Hello,
I am currently trying to implement the following model:
I have a number of political experts, who in the past have made estimation about the outcome of elections.
I assume each expert has an accuracy/edge which makes his prediction better than a random guess.
I have built a model, which allows me to both estimate an experts accuracy based on past events, and make inference about the outcome of a future event of which only the experts’ predictions are known to the model.
While the model can reconstruct the experts’ accuracy rather well, the implied probabilities for future events are not useful and the posterior looks way off what I would expect given the model.
This is a working minimum example:
# Imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
# Experts and Events
n_train_events = 10
n_test_events = 2
n_events = n_train_events + n_test_events
event_outcomes = np.random.binomial(1, 0.5, n_events)
n_experts = 10
n_observations = n_events * n_experts
# use a Beta (2,2)
expert_accuracy = np.random.beta(2, 2, n_experts)
# Dummy Dataset
expert_event_df = pd.DataFrame(columns=['expert_id', 'event_id', 'expert_pred', 'true_outcome'],
index=np.arange(n_observations))
i = 0
for event_idx in range(n_events):
event_outcome = event_outcomes[event_idx]
correct_pred = np.random.binomial(1, expert_accuracy)
random_guess = np.random.binomial(1, 0.5 * np.ones(n_experts))
event_pred = correct_pred * event_outcome + \
(1 - correct_pred) * random_guess
idx_slice = expert_event_df.index.isin(expert_event_df.index[i:i+n_experts])
expert_event_df.loc[idx_slice, 'expert_id'] = np.arange(n_experts)
expert_event_df.loc[idx_slice, 'event_id'] = event_idx
expert_event_df.loc[idx_slice, 'expert_pred'] = event_pred
expert_event_df.loc[idx_slice, 'true_outcome'] = event_outcome
i+= n_experts
# PyMC3 Model
expert_lookup = expert_event_df['expert_id'].astype(int).values
event_lookup = expert_event_df['event_id'].astype(int).values
observed_pred = expert_event_df['expert_pred'].values
# Test Event Outcomes are masked so that they are sampled
obs_outcomes = event_outcomes.copy()
obs_outcomes[-n_test_events:] = -1
masked_data = np.ma.masked_array(obs_outcomes, obs_outcomes == -1, fill_value=-1)
with pm.Model() as model:
pm_expert_accuracy = pm.Beta('pm_expert_accuracy',
alpha=1.0, beta=1.0,
shape=n_experts)
pm_event_prob = pm.Beta('pm_event_prob',
alpha=1.0, beta=1.0,
shape=n_events)
pm_event_outcome = pm.Bernoulli('pm_event_outcome',
p=pm_event_prob,
shape=n_events,
observed=masked_data)
broadcast_expert_accuracy = pm_expert_accuracy[expert_lookup]
broadcast_true_outcome = pm_event_outcome[event_lookup]
pm_expert_pred = pm.Deterministic('pm_expert_pred',
broadcast_expert_accuracy * (broadcast_true_outcome - 0.5) + 0.5)
expert_predictions = pm.Bernoulli('expert_predictions',
p=pm_expert_pred,
shape=n_observations,
observed=observed_pred)
# Sampling
with model:
trace = pm.sample(10000)
burned_trace = trace[1000:]
The problem is now with the posterior samples for pm_event_outcome_missing, so for the events whose outcomes I have masked. Even if I modify the dataset in a way that the experts’ estimates for the two events are completely random, the posterior distribution is always almost singular around 0 or 1.
Is there any obvious thing I am missing? Any advice on how to achieve a model which is better at reflecting uncertainty?
Thank you very much and kind regards.