Bayesian Modeling of Disease Case Counts: Validity of Priors and Predictors

I’d probably start modeling something like this as a GLM with a NegativeBinomial , so maybe along the lines of this?

import numba
import numpy as np
import pandas as pd
import pymc as pm
import pytensor.tensor as pt

# Simulate some data
N = 50
rng = np.random.default_rng(42)

# Create a small synthetic dataset with case counts (outcome)
# and some categorical predictors (state, district, block, date).
df = pd.DataFrame({
    "case_count": rng.integers(5, 50, size=N),
    "state": rng.choice(["A", "B", "C"], size=N),
    "district": rng.choice(["D", "E", "F"], size=N),
    "block": rng.choice(["G", "H", "I"], size=N),
    "date": pd.period_range(start="2020-01", periods=N, freq="M"),
})

state_indices, states = pd.factorize(df["state"])
district_indices, districts = pd.factorize(df["district"])
block_indices, blocks = pd.factorize(df["block"])
time_period_indices, time_periods = pd.factorize(df["date"])

# We set up named coordinates. These labels allow PyMC to keep track
# of which random effects belong to which group and can help with
# model diagnostics and trace plotting.
coords = {
    "state": states,
    "district": districts,
    "block": blocks,
    "time_period": time_periods,
    "observation": df.index,
    "variance_source": ["state", "district", "block", "time_period"]
}

with pm.Model(coords=coords) as model:
    # total_sd: The total standard deviation between all blocks
    # (Usually, I'd also include the variance from the likelihood, but
    # but I'm not sure what the best way of doing this is where there
    # final variance depends on the mean as with a NegativeBinomial)
    total_sd = pm.HalfNormal("total_sd", dims="variance_source")
    
    # The variance_ratio parameter (drawn from a Dirichlet) determines how
    # the total variance is split across the different sources. Each predictor
    # will get some fraction of the total variance.
    # The importance parameter can be used to include prior info.
    importance = np.ones(len(coords["variance_source"]))
    variance_ratio = pm.Dirichlet("variance_ratio", a=importance, dims="variance_source")
    
    # Because variance is the square of the standard deviation, we take the
    # square root of each ratio to get how much of total_sd goes to each source.
    stds = pm.Deterministic("stds", pt.sqrt(variance_ratio) * total_sd, dims="variance_source")

    # intercept: a global average across all observations, modeling the
    intercept = pm.Normal("intercept", sigma=5)

    # Random effects are given "zero-sum" constraints so they don't absorb
    # the model’s overall mean and remain identifiable. Each dimension
    # will have as many parameters as factor levels (A/B/C for states, etc.),
    # and those parameters sum to zero.
    # I like the sum-to-zero constraints mostly if the set of predictors is fixed.
    # If you want to make predictions for states that are not in the dataset,
    # a Normal will make that easier.
    state_unscaled = pm.ZeroSumNormal("state_unscaled", dims="state")
    state_effect   = pm.Deterministic("state_effect", state_unscaled * stds[0], dims="state")

    district_unscaled = pm.ZeroSumNormal("district_unscaled", dims="district")
    district_effect   = pm.Deterministic("district_effect", district_unscaled * stds[1], dims="district")

    block_unscaled = pm.ZeroSumNormal("block_unscaled", dims="block")
    block_effect   = pm.Deterministic("block_effect", block_unscaled * stds[2], dims="block")

    time_period_unscaled = pm.ZeroSumNormal("time_period_unscaled", dims="time_period")
    time_period_effect   = pm.Deterministic("time_period_effect", time_period_unscaled * stds[3], dims="time_period")

    mu = pt.exp(
        intercept
        + state_effect[state_indices]
        + district_effect[district_indices]
        + block_effect[block_indices]
        + time_period_effect[time_period_indices]
    )

    # alpha: controls the overdispersion in the Negative Binomial. Larger alpha
    # means higher variance relative to the mean, making it suitable for count
    # data that exhibit more spread than Poisson would allow.
    alpha = pm.HalfNormal("alpha", sigma=2)

    # Place a Negative Binomial likelihood on the observed counts,
    # linking the expected value mu and overdispersion alpha to the data.
    pm.NegativeBinomial("case_count", mu=mu, alpha=alpha,
                        observed=df["case_count"], dims="observation")

with model:
    trace = pm.sample(nuts_sampler="nutpie")

Frome here, maybe you could add for instance interactions. (for instance, maybe the time dependency is different in the different states or blocks.).
Or you could add some autocorrelation for the time effect? For instance using a GP or so.

You might also want to take population into account. In a larger state / block etc you might expect larger case counts…

1 Like