Bayesian Modeling of Disease Case Counts: Validity of Priors and Predictors

Question:
We are working on a project analyzing disease case counts across different regions in an Indian state. The administrative hierarchy follows a state → district → block structure. Our goal is to identify key predictors influencing disease case counts (target variable) for a given month and year, accounting for the temporal variation in disease spread.

To model this, we assume:

  1. The intercept should never be negative since, even with zero-valued predictors, disease cases cannot be below zero.
  2. Coefficients can be either positive or negative, depending on whether the predictor has a positive or negative correlation with disease cases.

Based on these assumptions, we selected appropriate priors and distributions for our Bayesian hierarchical model.

Question:

Is our modeling approach and choice of priors/distributions correct given these assumptions? Also, are there any improvements we should consider in structuring our model?
@adam @ricardoV94

I’d probably start modeling something like this as a GLM with a NegativeBinomial , so maybe along the lines of this?

import numba
import numpy as np
import pandas as pd
import pymc as pm
import pytensor.tensor as pt

# Simulate some data
N = 50
rng = np.random.default_rng(42)

# Create a small synthetic dataset with case counts (outcome)
# and some categorical predictors (state, district, block, date).
df = pd.DataFrame({
    "case_count": rng.integers(5, 50, size=N),
    "state": rng.choice(["A", "B", "C"], size=N),
    "district": rng.choice(["D", "E", "F"], size=N),
    "block": rng.choice(["G", "H", "I"], size=N),
    "date": pd.period_range(start="2020-01", periods=N, freq="M"),
})

state_indices, states = pd.factorize(df["state"])
district_indices, districts = pd.factorize(df["district"])
block_indices, blocks = pd.factorize(df["block"])
time_period_indices, time_periods = pd.factorize(df["date"])

# We set up named coordinates. These labels allow PyMC to keep track
# of which random effects belong to which group and can help with
# model diagnostics and trace plotting.
coords = {
    "state": states,
    "district": districts,
    "block": blocks,
    "time_period": time_periods,
    "observation": df.index,
    "variance_source": ["state", "district", "block", "time_period"]
}

with pm.Model(coords=coords) as model:
    # total_sd: The total standard deviation between all blocks
    # (Usually, I'd also include the variance from the likelihood, but
    # but I'm not sure what the best way of doing this is where there
    # final variance depends on the mean as with a NegativeBinomial)
    total_sd = pm.HalfNormal("total_sd", dims="variance_source")
    
    # The variance_ratio parameter (drawn from a Dirichlet) determines how
    # the total variance is split across the different sources. Each predictor
    # will get some fraction of the total variance.
    # The importance parameter can be used to include prior info.
    importance = np.ones(len(coords["variance_source"]))
    variance_ratio = pm.Dirichlet("variance_ratio", a=importance, dims="variance_source")
    
    # Because variance is the square of the standard deviation, we take the
    # square root of each ratio to get how much of total_sd goes to each source.
    stds = pm.Deterministic("stds", pt.sqrt(variance_ratio) * total_sd, dims="variance_source")

    # intercept: a global average across all observations, modeling the
    intercept = pm.Normal("intercept", sigma=5)

    # Random effects are given "zero-sum" constraints so they don't absorb
    # the model’s overall mean and remain identifiable. Each dimension
    # will have as many parameters as factor levels (A/B/C for states, etc.),
    # and those parameters sum to zero.
    # I like the sum-to-zero constraints mostly if the set of predictors is fixed.
    # If you want to make predictions for states that are not in the dataset,
    # a Normal will make that easier.
    state_unscaled = pm.ZeroSumNormal("state_unscaled", dims="state")
    state_effect   = pm.Deterministic("state_effect", state_unscaled * stds[0], dims="state")

    district_unscaled = pm.ZeroSumNormal("district_unscaled", dims="district")
    district_effect   = pm.Deterministic("district_effect", district_unscaled * stds[1], dims="district")

    block_unscaled = pm.ZeroSumNormal("block_unscaled", dims="block")
    block_effect   = pm.Deterministic("block_effect", block_unscaled * stds[2], dims="block")

    time_period_unscaled = pm.ZeroSumNormal("time_period_unscaled", dims="time_period")
    time_period_effect   = pm.Deterministic("time_period_effect", time_period_unscaled * stds[3], dims="time_period")

    mu = pt.exp(
        intercept
        + state_effect[state_indices]
        + district_effect[district_indices]
        + block_effect[block_indices]
        + time_period_effect[time_period_indices]
    )

    # alpha: controls the overdispersion in the Negative Binomial. Larger alpha
    # means higher variance relative to the mean, making it suitable for count
    # data that exhibit more spread than Poisson would allow.
    alpha = pm.HalfNormal("alpha", sigma=2)

    # Place a Negative Binomial likelihood on the observed counts,
    # linking the expected value mu and overdispersion alpha to the data.
    pm.NegativeBinomial("case_count", mu=mu, alpha=alpha,
                        observed=df["case_count"], dims="observation")

with model:
    trace = pm.sample(nuts_sampler="nutpie")

Frome here, maybe you could add for instance interactions. (for instance, maybe the time dependency is different in the different states or blocks.).
Or you could add some autocorrelation for the time effect? For instance using a GP or so.

You might also want to take population into account. In a larger state / block etc you might expect larger case counts…

1 Like
import numpy as np
import pandas as pd
import pymc as pm

# Add a small constant to avoid issues with zeros in target variable
df_train['block_disease_count_per_100k'] += 1e-6

# Define district and block-level features
district_features = [
    'Population_y', 'district_forest_count_per_100k', 'district_farmland_count_per_100k',
    'district_water_bodies_per_100k', 'district_parks_per_100k',
    'district_hospitals_per_100k', 'district_sanitation_per_100k'
]

block_features = [
    'Population_x', 'block_forest_count_per_100k', 'block_farmland_count_per_100k',
     'block_block_water_bodies_per_100k','block_block_parks_per_100k',
    'block_block_hospital_count_per_100k', 'block_block_sanitation_per_100k'
]

# Create composite identifier: block_month_year
df_train['block_month_year'] = (
    df_train['Block'].astype(str) + "_" +
    df_train['Month'].astype(str) + "_" +
    df_train['Year'].astype(str)
)

# Factorize (encode) indices for hierarchical levels
district_idx_train, districts = pd.factorize(df_train['district'], sort=True)
block_idx_train, blocks = pd.factorize(df_train['Block'], sort=True)
block_month_year_idx_train, block_month_years = pd.factorize(df_train['block_month_year'], sort=True)

# Mapping blocks to districts
block_to_district_mapping = pd.factorize(df_train.groupby('Block')['district'].first())[0]

# PyMC coordinates
coords = {
    'district': districts,
    'block': blocks,
    'block_month_year': block_month_years,
    'district_feature': district_features,
    'block_feature': block_features,
}
# block_disease_count_per_100k
# Build PyMC Hierarchical Model
with pm.Model(coords=coords) as hierarchical_model:
    # Data
    X_district = pm.Data('X_district', df_train[district_features])  # District-level features
    X_block = pm.Data('X_block', df_train[block_features])  # Block-level features
    y = pm.Data('y', df_train['block_disease_count_per_100k'])  # Target variable
    district_idx_pt = pm.Data('district_idx', district_idx_train)  # District indices
    block_idx_pt = pm.Data('block_idx', block_idx_train)  # Block indices
    block_month_year_idx_pt = pm.Data('block_month_year_idx', block_month_year_idx_train)  # Time indices

    # District-level parameters
    # mu_district = pm.Normal('mu_district', mu=0, sigma=1)  # Mean intercept for districts
    # sigma_district = pm.Exponential('sigma_district', lam=1)  # District variability
    # a_district = pm.Normal('a_district', mu=mu_district, sigma=sigma_district, dims=['district'])
    sigma_district = pm.LogNormal('sigma_district', mu=0, sigma=0.01)  # LogNormal for positive variance
    a_district = pm.HalfNormal('a_district', sigma=sigma_district, dims=['district'],initval=np.ones(len(districts))*0.01)  # Half-Normal for positive intercepts


    # # District-level slopes for district features
    # mu_b_district = pm.Normal('mu_b_district', mu=0, sigma=1, dims=['district_feature'])  # Slopes for district features
    # sigma_b_district = pm.Exponential('sigma_b_district', lam=1)  # Slope variability
    # b_district = pm.Normal('b_district', mu=mu_b_district, sigma=sigma_b_district, dims=['district', 'district_feature'])
     # District-level slopes for each feature
    mu_b_district = pm.Normal('mu_b_district', mu=0, sigma=0.01, dims=['district_feature'])  # Normal distribution for mean
    # sigma_b_district = pm.Normal('sigma_b_district', mu=0, sigma=1)  # Normal for slope variability
    sigma_b_district = pm.LogNormal('sigma_b_district',mu=0,sigma=0.01)  # HalfNormal for slope variability
    # sigma_b_district = pm.Exponential('sigma_b_district', lam=1)  # Exponential for slope variability

    b_district = pm.Normal('b_district', mu=mu_b_district, sigma=sigma_b_district, dims=['district', 'district_feature'],initval=np.zeros((len(districts), len(district_features)))) # Normal for slopes

    # Block-level parameters (nested within districts)
    # sigma_a_block = pm.Exponential('sigma_a_block', lam=1)
    # a_block = pm.Normal(
    #     'a_block',
    #     mu=a_district[block_to_district_mapping],
    #     sigma=sigma_a_block,
    #     dims=['block']
    # )
    a_block = pm.HalfNormal(
    'a_block',
    sigma=a_district[block_to_district_mapping],  # Using district intercept as variability
    dims=['block'],
    initval=np.ones(len(blocks))*0.01
    )


    # Block-level slopes for block features
    # mu_b_block = pm.Normal('mu_b_block', mu=0, sigma=1, dims=['block_feature'])  # Slopes for block features
    # sigma_b_block = pm.Exponential('sigma_b_block', lam=1)  # Slope variability
    # b_block = pm.Normal(
    #     'b_block',
    #     mu=mu_b_block,
    #     sigma=sigma_b_block,
    #     dims=['block', 'block_feature']
    # )
    sigma_b_block = pm.LogNormal('sigma_b_block',mu=0,sigma=0.01)

    b_block = pm.Normal(
        'b_block',
        mu=b_district[block_to_district_mapping],
        sigma=sigma_b_block,
        dims=['block', 'block_feature'],
        initval=np.zeros((len(blocks), len(block_features)))
    )


    # Block-Month-Year level parameters
    # sigma_a_block_month_year = pm.Exponential('sigma_a_block_month_year', lam=1)
    # a_block_month_year = pm.HalfNormal(
    #     'a_block_month_year',
    #     mu=a_block[block_idx_pt],
    #     sigma=sigma_a_block_month_year,
    #     dims=['block_month_year']
    # )
    a_block_month_year = pm.HalfNormal(
    'a_block_month_year',
    sigma=a_block[block_idx_pt],  # Using block-level intercept as variability
    dims=['block_month_year'],
    initval=np.ones(len(block_month_years)) * 0.01  # Positive initial values
    )

    # Block-Month-Year level slopes for both district and block features
    sigma_b_block_month_year = pm.LogNormal('sigma_b_block_month_year', mu=0,sigma=0.01)

    # b_block_month_year_district = pm.Normal(
    #     'b_block_month_year_district',
    #     mu=b_district[block_to_district_mapping],
    #     sigma=sigma_b_block_month_year,
    #     dims=['block_month_year', 'district_feature']
    # )

    b_block_month_year = pm.Normal(
        'b_block_month_year',
        mu=b_block[block_idx_pt],
        sigma=sigma_b_block_month_year,
        dims=['block_month_year', 'block_feature'],
        initval=np.zeros((len(block_month_years), len(block_features)))
    )

    # # Model error
    # sigma = pm.Exponential('sigma', lam=1)

    # # Expected value using block-month-year parameters
    # district_contrib = (X_district * b_block_month_year_district[block_month_year_idx_pt]).sum(axis=1)
    # block_contrib = (X_block * b_block_month_year_block[block_month_year_idx_pt]).sum(axis=1)
    # mu = a_block_month_year[block_month_year_idx_pt] + district_contrib + block_contrib
    # Expected value using block-month-year parameters
    district_contrib = (X_district * b_district[district_idx_pt]).sum(axis=1)
    block_contrib = (X_block * b_block_month_year[block_month_year_idx_pt]).sum(axis=1)
    # mu = a_block_month_year[block_month_year_idx_pt] + block_contrib
    mu = pm.math.maximum(a_block_month_year[block_month_year_idx_pt] + block_contrib, 1e-6)


    # Likelihood: Exponential for positive target variable
    # Y_obs = pm.Exponential('Y_obs', lam=1 / mu, observed=y)
    Y_obs = pm.Gamma('Y_obs', alpha=1.0, beta=1 / mu, observed=y)


    # Likelihood: Gamma for positive target variable
    # Y_obs = pm.Gamma('Y_obs', alpha=mu, beta=sigma, observed=y)

    # # Sample using NUTS
    # trace = pm.sample(
    #     1000,
    #     cores=2,
    #     chains=2,
    #     random_seed=42,
    #     return_inferencedata=True
    # )

this code i wrote what chaneg i can do here