Hierarchical model for counts with categorical data - help structuring

Hi all,

New here and working my way through examples from various references. I have seen multiple ways in which to structure hierarchical models with categorical variables. As a toy example, say I have a system that collects overdispersed count data (say units consumed) which can be broken down into multiple phases, is associated with a certain available geographic hierarchies (e.g. local, regional, etc.), and where there are multiple processes put into effect to influence the count outcome.

I’ve put together the following model structuring, thinking I could model each process separately. I have seen examples of this structure online and in reference materials.

count = data['count'].values
phase_idx, phase = data['phase'].factorize(sort=True)
region_idx, region = data['region'].factorize(sort=True)
process_idx, process = process['process'].factorize(sort=True)

COORDS = {
    'obs_id': np.arange(len(phase_idx)),
    'phase': phase,
    'region': region,
    'process': process
}

with pm.Model(coords = COORDS) as count_model:
    
    data = pm.Data('counts', count, dims=('obs_id'))
    
    ## Hyperpriors 
    mu_lam = pm.Exponential('mu_lam', lam=0.3)
    a_lam = pm.Exponential('a_lam', lam=2)

    ## Priors
    μ = pm.Exponential('μ', lam= mu_lam, dims= ('phase', 'region'))
    α = pm.Exponential('α', lam= a_lam, dims = ('phase', 'region'))
    
    # Likelihood
    obs = pm.NegativeBinomial('obs', mu= μ[phase_idx, region_idx], \
                                         alpha= α[phase_idx, region_idx], observed= data, dims= ('obs_id'))


    idata_trace = pm.sample(draws= 10_000, tune=5_000, chains=2, target_accept= 0.95)

Is this valid way to structure the model? It samples fine with made up data and provides the output of interest - describing the uncertainty around μ for each categorical variable.

How about all three variables? Unfortunately, the three variables keep killing my kernel.

# Likelihood
obs = pm.NegativeBinomial('obs', mu= μ[phase_idx, region_idx, process_idx], \
                                      alpha= α[phase_idx, region_idx, process_idx], observed= data, dims= ('obs_id'))

Thanks in advance!

2 Likes

I’d like to think a bit more deeply about this, but at first glance this looks good to me.

Just a question to clarify: when does the kernel get killed? Is it during sampling, or before it even gets going? And how many parameters do you end up estimating when you add the third variable?

I’m asking because I’m wondering if the dead kernel might actually be due to your computer running out of memory! If your dataset is large and you have a lot of parameters, that could happen, especially because you’re making 10,000 draws on two chains each, which is a lot. Is there a reason you’re drawing so many? In general, I would think that 1,000 should be more than sufficient.

1 Like

Hi Martin,

Thanks for the feedback. I’m pretty sure it is a RAM issue as well as browser. The memory limits kill the kernel but I was also have a stall out of the Safari browser, which would then kill and refresh the Jupyter Notebook visuals. Seems to have resolved with Firefox.

No reason specifically on the 10K draws. On some test data that were mixed (e.g. some missing phases, different number of regions, etc.) I had some ESS warnings, so thought I would boost the total number of draws.

1 Like