I am just getting started with PyMC3, so apologies if this is a basic question!
I am working with some COVID-19 data, and I have for a sample of countries, the number of cases and the number of deaths. I also have two predictors which are thought to influence the number of deaths. I’ve structured my model as a hierarchical beta-binomial, where for each country, the predictors influence the expected value of a beta distribution which is then fed to the likelihood - a fully reproducible example is below:
import arviz as az import pandas as pd import pymc3 as pm import numpy as np np.random.seed(32) # Retrieve data data = pd.read_csv('https://osf.io/pf9te/download', index_col='country') # Set up model with pm.Model() as bebi: # Set priors on linear coefficients β0 = pm.Normal('intercept', 0, 3) β1 = pm.Normal('male attr', 0, 3) β2 = pm.Normal('female attr', 0, 3) # Prior on variance for beta distribution for each country σ = pm.Exponential('Beta Variance', 50) # Compute mean of each countries beta distribution μ = pm.math.invlogit(β0*data['constant'] + β1*data['male_attr'] + β2*data['female_attr']) # Use the above to create a beta prior ps = pm.Beta('predicted % death', mu=μ, sigma=σ, shape=data.shape) # Likelihood is binomial y = pm.Binomial('estimated deaths', n=data['cases'].values, p=ps, observed=data['deaths']) # Inference button trace = pm.sample(10_000, tune=1000, cores=8) ppc = pm.sample_posterior_predictive(trace, samples=2000)
This seems to make sense - the model returns a sensible looking ppc, but I am still unsure if I have formulated this correctly, so any advice is appreciated!
My second question is, assuming the above is sensible, is in the interpretation of an aspect of the posterior. For the purposes of a paper, I am interested in the uncertainty of the total number of deaths suggested by the model. I can sum the total number of deaths observed, and then sum across each draw from the ppc, and each time compute the difference from the observed total number of deaths, like so:
# Get number of observed deaths n_deaths = data['deaths'].sum(0) # Coerce ppc into a dataframe pp = pd.DataFrame(ppc['estimated deaths'].T, index=data.index) # Compute difference-in-deaths per sample death_diff = pp.sum(0) - n_deaths # Plot az.plot_density(death_diff.values)
I think I am confusing myself about what a posterior predictive check is/means, but does this distribution-of-differences represent anything interesting in terms of the models predictors? The mean of the distribution is close to zero, which must mean that most of the time, the number of deaths predicted by the model is very close to the observed. Does the density of the distribution reflect anything about the quality of the predictors - i.e. if the predictors were useful, would we expect to see a distribution-of-differences more bunched up around zero?
I hope that makes sense - I think this is probably something basic in my misunderstanding of the Bayesian machinery so any help is appreciated! Thank you