Hierarchical model for counts with categorical data - help structuring

dv_mepi · January 14, 2022, 2:32pm

Hi all,

New here and working my way through examples from various references. I have seen multiple ways in which to structure hierarchical models with categorical variables. As a toy example, say I have a system that collects overdispersed count data (say units consumed) which can be broken down into multiple phases, is associated with a certain available geographic hierarchies (e.g. local, regional, etc.), and where there are multiple processes put into effect to influence the count outcome.

I’ve put together the following model structuring, thinking I could model each process separately. I have seen examples of this structure online and in reference materials.

count = data['count'].values
phase_idx, phase = data['phase'].factorize(sort=True)
region_idx, region = data['region'].factorize(sort=True)
process_idx, process = process['process'].factorize(sort=True)

COORDS = {
    'obs_id': np.arange(len(phase_idx)),
    'phase': phase,
    'region': region,
    'process': process
}

with pm.Model(coords = COORDS) as count_model:
    
    data = pm.Data('counts', count, dims=('obs_id'))
    
    ## Hyperpriors 
    mu_lam = pm.Exponential('mu_lam', lam=0.3)
    a_lam = pm.Exponential('a_lam', lam=2)

    ## Priors
    μ = pm.Exponential('μ', lam= mu_lam, dims= ('phase', 'region'))
    α = pm.Exponential('α', lam= a_lam, dims = ('phase', 'region'))
    
    # Likelihood
    obs = pm.NegativeBinomial('obs', mu= μ[phase_idx, region_idx], \
                                         alpha= α[phase_idx, region_idx], observed= data, dims= ('obs_id'))


    idata_trace = pm.sample(draws= 10_000, tune=5_000, chains=2, target_accept= 0.95)

Is this valid way to structure the model? It samples fine with made up data and provides the output of interest - describing the uncertainty around μ for each categorical variable.

How about all three variables? Unfortunately, the three variables keep killing my kernel.

# Likelihood
obs = pm.NegativeBinomial('obs', mu= μ[phase_idx, region_idx, process_idx], \
                                      alpha= α[phase_idx, region_idx, process_idx], observed= data, dims= ('obs_id'))

Thanks in advance!

Martin_Ingram · January 16, 2022, 5:23pm

I’d like to think a bit more deeply about this, but at first glance this looks good to me.

Just a question to clarify: when does the kernel get killed? Is it during sampling, or before it even gets going? And how many parameters do you end up estimating when you add the third variable?

I’m asking because I’m wondering if the dead kernel might actually be due to your computer running out of memory! If your dataset is large and you have a lot of parameters, that could happen, especially because you’re making 10,000 draws on two chains each, which is a lot. Is there a reason you’re drawing so many? In general, I would think that 1,000 should be more than sufficient.

dv_mepi · January 18, 2022, 2:56pm

Hi Martin,

Thanks for the feedback. I’m pretty sure it is a RAM issue as well as browser. The memory limits kill the kernel but I was also have a stall out of the Safari browser, which would then kill and refresh the Jupyter Notebook visuals. Seems to have resolved with Firefox.

No reason specifically on the 10K draws. On some test data that were mixed (e.g. some missing phases, different number of regions, etc.) I had some ESS warnings, so thought I would boost the total number of draws.

Topic		Replies	Views
Hierarchical model, categorical outcome Questions	0	537	September 15, 2020
Modelling groups with different number of observations Questions	3	1249	August 19, 2018
Modeling with count data as predictors and continuous as outcome variable in pymc v5 hierarchical	2	1121	November 9, 2022
Question about Hierarchical Dirichlet-Multinomial v5 modeling , hierarchical	9	340	April 22, 2024
Hierarchical Modeling with Multiple Grouping Variables Questions	1	890	March 11, 2020

Hierarchical model for counts with categorical data - help structuring

Related topics