Hello all,
I am just getting started with PyMC3, so apologies if this is a basic question!
I am working with some COVID-19 data, and I have for a sample of countries, the number of cases and the number of deaths. I also have two predictors which are thought to influence the number of deaths. I’ve structured my model as a hierarchical beta-binomial, where for each country, the predictors influence the expected value of a beta distribution which is then fed to the likelihood - a fully reproducible example is below:
import arviz as az
import pandas as pd
import pymc3 as pm
import numpy as np
np.random.seed(32)
# Retrieve data
data = pd.read_csv('https://osf.io/pf9te/download', index_col='country')
# Set up model
with pm.Model() as bebi:
# Set priors on linear coefficients
β0 = pm.Normal('intercept', 0, 3)
β1 = pm.Normal('male attr', 0, 3)
β2 = pm.Normal('female attr', 0, 3)
# Prior on variance for beta distribution for each country
σ = pm.Exponential('Beta Variance', 50)
# Compute mean of each countries beta distribution
μ = pm.math.invlogit(β0*data['constant'] + β1*data['male_attr'] + β2*data['female_attr'])
# Use the above to create a beta prior
ps = pm.Beta('predicted % death', mu=μ, sigma=σ, shape=data.shape[0])
# Likelihood is binomial
y = pm.Binomial('estimated deaths', n=data['cases'].values, p=ps, observed=data['deaths'])
# Inference button
trace = pm.sample(10_000, tune=1000, cores=8)
ppc = pm.sample_posterior_predictive(trace, samples=2000)
This seems to make sense - the model returns a sensible looking ppc, but I am still unsure if I have formulated this correctly, so any advice is appreciated!
My second question is, assuming the above is sensible, is in the interpretation of an aspect of the posterior. For the purposes of a paper, I am interested in the uncertainty of the total number of deaths suggested by the model. I can sum the total number of deaths observed, and then sum across each draw from the ppc, and each time compute the difference from the observed total number of deaths, like so:
# Get number of observed deaths
n_deaths = data['deaths'].sum(0)
# Coerce ppc into a dataframe
pp = pd.DataFrame(ppc['estimated deaths'].T, index=data.index)
# Compute difference-in-deaths per sample
death_diff = pp.sum(0) - n_deaths
# Plot
az.plot_density(death_diff.values)
I think I am confusing myself about what a posterior predictive check is/means, but does this distribution-of-differences represent anything interesting in terms of the models predictors? The mean of the distribution is close to zero, which must mean that most of the time, the number of deaths predicted by the model is very close to the observed. Does the density of the distribution reflect anything about the quality of the predictors - i.e. if the predictors were useful, would we expect to see a distribution-of-differences more bunched up around zero?
I hope that makes sense - I think this is probably something basic in my misunderstanding of the Bayesian machinery so any help is appreciated! Thank you