Hierarchical model with multiple factors in the same level

Hi all!

I have a problem where I want to make a simple linear regression involving two fish body measurements (y \sim a+b*x). There are two fish species (first level) and three factors with two groups each: sex, season, and maturity stage. I want to estimate the parameters for each species (first level) and for each group within each species, i.e., for each sex, season, and maturity stage; that is, something like this:

species level: \alpha_s and \beta_s
sex: \alpha_{s, female}, \alpha_{s, male}, \beta_{s, female}, \beta_{s, male}
season: \alpha_{s, season1}, \alpha_{s, season2}, \beta_{s, season1}, \beta_{s, season2}
maturity: \alpha_{s, immature}, \alpha_{s, mature}, \beta_{s, immature}, \beta_{s, mature}

Where s is the species.

A minimal model for one factor is straightforward:

coords = {'Parameter': ['a', 'b'],
          'Species': ['Sp1', 'Sp2'],
          'Season': ['Season1', 'Season2'}
with pm.Model(coords = coords) as season_model:
# shared priors
pars = pm.Normal('B', mu = 0, sigma = 1,
                 dims = ('Parameter', 'Species'))
eps = pm.Exponential('e', lam = 2, dims = 'Species')

# group priors
a_season = pm.Normal('a_season', mu = pars[0], sigma = 1,
           dims = ('Species', 'Season'))
b_season = pm.Normal('b_season', mu = pars[1], sigma = 1,
           dims = ('Species', 'Season'))

# linear model
lm = a_season[sp_id, season_id] + beta_season[sp_id, season_id]*x

# likelihood
y_pred = pm.Normal('y_pred',
                   mu = lm,
                   sigma = eps[sp_id],
                   observed = y)

Which yields the desired parameters:

a[sp1]
a[sp2]
a[sp1, season1]
a[sp2, season2]
b[sp1]
...

Note that, for this problem, I’m not interested in getting parameters like a[sp1, season1, female, mature].

I’m assuming that extending the model to include the other factors requires either working with the indices or some form of matrix multiplication; however, I have not found a way other than (quite suboptimally) constructing three separate hierarchical models for the factor-level parameters (one for each factor) and another for the species.

Any suggestions will be appreciated.

1 Like

Hmm could you say a bit about why? It does sound like you’d want 16 unique intercepts and 16 slopes (assuming 2 species, seasons, sexes, maturity levels). So having each parameter get four indexes would be on track?

Regardless, one option to simplify the indexing is to build a coding scheme which takes every combination of the sub-species level properties and builds a single index. So if there are 8 combinations of sex, season, maturity, you code have a variable 0-7 that indicates which combination you are on. For example male, winter, immature might be 0 while female, summer, mature might be 7. Then coding the multilevel model isn’t too complex.

Here’s that idea expressed in code. Suppose your data looks something like:

# randomly assign fish species, season, sex and maturity, assuming each is binary

data = np.random.choice([0,1],size=(100,4))
df = pd.DataFrame(data,columns=['species','season','sex','maturity'])

# generate ids for each unique combination of season, sex and maturity

id_strings = [str(df[['season','sex','maturity']].values[i]) for i in range(100)]
df['categories'] = pd.Categorical(id_strings).codes

# generate the x measurement

x = np.random.normal(size=100)
df['x'] = x

df.head()

Gives you synthetic data like:

species season sex maturity categories x
0 0 0 0 1 1 -1.167921
1 0 0 0 0 0 -0.025337
2 1 0 0 1 1 1.635376
3 0 1 0 0 4 -0.351800
4 0 1 0 0 4 1.398753

Then you code the model like:

species_id = df['species'].values
category_id = df['categories'].values
coords = {'Species': ['Sp1', 'Sp2'],
          'Category': [0,1,2,3,4,5,6,7]}

with pm.Model(coords = coords) as season_model:
    # shared priors
    shared_A = pm.Normal('shared_A', mu = 0, sigma = 1)
    shared_B = pm.Normal('shared_B', mu = 0, sigma = 1)
    e = pm.Exponential('e',1)

    # species level
    a_species = pm.Normal('a_species', mu = shared_A, sigma = 1, dims = 'Species')
    b_species = pm.Normal('b_species', mu = shared_B, sigma = 1, dims = 'Species')
    
    # group priors
    a = pm.Normal('a', mu = a_species, sigma = 1, dims = ('Category','Species'))
    b = pm.Normal('b', mu = b_species, sigma = 1, dims = ('Category','Species'))
    
    # linear model
    lm = a[category_id, species_id] + b[category_id, species_id]*x
    
    # likelihood
    y_pred = pm.Normal('y_pred',
                       mu = lm,
                       sigma = e)

Does this look the type of model you are trying to build?

2 Likes

Hi, Daniel!

Thank you very much for the response, and I’m sorry for the late reply. I didn’t realize you answered (I just returned from a field trip and probably missed the notification email).

Yes, you are right. Ideally, I’d estimate the 16 parameters; however, the sample sizes get awfully reduced, with some empty combinations. Unfortunately, the people who collected the data did not have such a design in mind. From what I’m told, the sampling was also somewhat opportunistic, which didn’t help either.

I originally coded the sub-species parameters with dims = ('species', 'season', 'sex', 'maturity'), and the species-level parameters with dims = 'species', yielding 8 indices per species (16 indices total). Big was my surprise when I realized the missing and poorly represented groups (reminder to NEVER skip EDA trusting what others tell me), but I digress. Unless I’m missing something, wouldn’t both your coding scheme and my approach yield the same model?

I opted to build separate models:

One for only the species.
One for the sexes within species.
One for the seasons within species.
One for the maturity stages within species.

This “separating” approach does not feel right, though. The species-level parameters in the sub-species models are “wasted”. At some point, I thought of “melting” the data into a single sub-species factor with each separate category (males, females, summer, winter, mature, immature). This would solve the problem of generating the correct indices, but create a problem with the replication of the observations.

1 Like

yeah they should be.

Big was my surprise when I realized the missing and poorly represented groups (reminder to NEVER skip EDA trusting what others tell me), but I digress

I think this might be a place where Bayes naturally does something quite helpful. Bayesian models have no minimal sample size requirement to behave well. If a category is empty or has a small sample size, it’s parameter will mostly be determined by the prior + the partial pooling from the categories that do have data. It’s a bit like if you have a hierarchical model and you wanted to do out of sample prediction on a new cluster. That new cluster doesn’t have any data. But everything your hyper parameters learned from the data-rich clusters will be used to predict the new cluster.

So i wouldn’t fuss about building separate models to compensate for small samples in some categories. Just build the one big model that you think is right.

1 Like