# Hierarchical model with multiple factors in the same level

Hi all!

I have a problem where I want to make a simple linear regression involving two fish body measurements (y \sim a+b*x). There are two fish species (first level) and three factors with two groups each: sex, season, and maturity stage. I want to estimate the parameters for each species (first level) and for each group within each species, i.e., for each sex, season, and maturity stage; that is, something like this:

species level: \alpha_s and \beta_s
sex: \alpha_{s, female}, \alpha_{s, male}, \beta_{s, female}, \beta_{s, male}
season: \alpha_{s, season1}, \alpha_{s, season2}, \beta_{s, season1}, \beta_{s, season2}
maturity: \alpha_{s, immature}, \alpha_{s, mature}, \beta_{s, immature}, \beta_{s, mature}

Where s is the species.

A minimal model for one factor is straightforward:

coords = {'Parameter': ['a', 'b'],
'Species': ['Sp1', 'Sp2'],
'Season': ['Season1', 'Season2'}
with pm.Model(coords = coords) as season_model:
# shared priors
pars = pm.Normal('B', mu = 0, sigma = 1,
dims = ('Parameter', 'Species'))
eps = pm.Exponential('e', lam = 2, dims = 'Species')

# group priors
a_season = pm.Normal('a_season', mu = pars[0], sigma = 1,
dims = ('Species', 'Season'))
b_season = pm.Normal('b_season', mu = pars[1], sigma = 1,
dims = ('Species', 'Season'))

# linear model
lm = a_season[sp_id, season_id] + beta_season[sp_id, season_id]*x

# likelihood
y_pred = pm.Normal('y_pred',
mu = lm,
sigma = eps[sp_id],
observed = y)


Which yields the desired parameters:

a[sp1]
a[sp2]
a[sp1, season1]
a[sp2, season2]
b[sp1]
...


Note that, for this problem, Iâ€™m not interested in getting parameters like a[sp1, season1, female, mature].

Iâ€™m assuming that extending the model to include the other factors requires either working with the indices or some form of matrix multiplication; however, I have not found a way other than (quite suboptimally) constructing three separate hierarchical models for the factor-level parameters (one for each factor) and another for the species.

Any suggestions will be appreciated.

1 Like

Hmm could you say a bit about why? It does sound like youâ€™d want 16 unique intercepts and 16 slopes (assuming 2 species, seasons, sexes, maturity levels). So having each parameter get four indexes would be on track?

Regardless, one option to simplify the indexing is to build a coding scheme which takes every combination of the sub-species level properties and builds a single index. So if there are 8 combinations of sex, season, maturity, you code have a variable 0-7 that indicates which combination you are on. For example male, winter, immature might be 0 while female, summer, mature might be 7. Then coding the multilevel model isnâ€™t too complex.

Hereâ€™s that idea expressed in code. Suppose your data looks something like:

# randomly assign fish species, season, sex and maturity, assuming each is binary

data = np.random.choice([0,1],size=(100,4))
df = pd.DataFrame(data,columns=['species','season','sex','maturity'])

# generate ids for each unique combination of season, sex and maturity

id_strings = [str(df[['season','sex','maturity']].values[i]) for i in range(100)]
df['categories'] = pd.Categorical(id_strings).codes

# generate the x measurement

x = np.random.normal(size=100)
df['x'] = x



Gives you synthetic data like:

species season sex maturity categories x
0 0 0 0 1 1 -1.167921
1 0 0 0 0 0 -0.025337
2 1 0 0 1 1 1.635376
3 0 1 0 0 4 -0.351800
4 0 1 0 0 4 1.398753

Then you code the model like:

species_id = df['species'].values
category_id = df['categories'].values
coords = {'Species': ['Sp1', 'Sp2'],
'Category': [0,1,2,3,4,5,6,7]}

with pm.Model(coords = coords) as season_model:
# shared priors
shared_A = pm.Normal('shared_A', mu = 0, sigma = 1)
shared_B = pm.Normal('shared_B', mu = 0, sigma = 1)
e = pm.Exponential('e',1)

# species level
a_species = pm.Normal('a_species', mu = shared_A, sigma = 1, dims = 'Species')
b_species = pm.Normal('b_species', mu = shared_B, sigma = 1, dims = 'Species')

# group priors
a = pm.Normal('a', mu = a_species, sigma = 1, dims = ('Category','Species'))
b = pm.Normal('b', mu = b_species, sigma = 1, dims = ('Category','Species'))

# linear model
lm = a[category_id, species_id] + b[category_id, species_id]*x

# likelihood
y_pred = pm.Normal('y_pred',
mu = lm,
sigma = e)


Does this look the type of model you are trying to build?

2 Likes

Hi, Daniel!

Thank you very much for the response, and Iâ€™m sorry for the late reply. I didnâ€™t realize you answered (I just returned from a field trip and probably missed the notification email).

Yes, you are right. Ideally, Iâ€™d estimate the 16 parameters; however, the sample sizes get awfully reduced, with some empty combinations. Unfortunately, the people who collected the data did not have such a design in mind. From what Iâ€™m told, the sampling was also somewhat opportunistic, which didnâ€™t help either.

I originally coded the sub-species parameters with dims = ('species', 'season', 'sex', 'maturity'), and the species-level parameters with dims = 'species', yielding 8 indices per species (16 indices total). Big was my surprise when I realized the missing and poorly represented groups (reminder to NEVER skip EDA trusting what others tell me), but I digress. Unless Iâ€™m missing something, wouldnâ€™t both your coding scheme and my approach yield the same model?

I opted to build separate models:

One for only the species.
One for the sexes within species.
One for the seasons within species.
One for the maturity stages within species.

This â€śseparatingâ€ť approach does not feel right, though. The species-level parameters in the sub-species models are â€śwastedâ€ť. At some point, I thought of â€śmeltingâ€ť the data into a single sub-species factor with each separate category (males, females, summer, winter, mature, immature). This would solve the problem of generating the correct indices, but create a problem with the replication of the observations.

1 Like

yeah they should be.

Big was my surprise when I realized the missing and poorly represented groups (reminder to NEVER skip EDA trusting what others tell me), but I digress

I think this might be a place where Bayes naturally does something quite helpful. Bayesian models have no minimal sample size requirement to behave well. If a category is empty or has a small sample size, itâ€™s parameter will mostly be determined by the prior + the partial pooling from the categories that do have data. Itâ€™s a bit like if you have a hierarchical model and you wanted to do out of sample prediction on a new cluster. That new cluster doesnâ€™t have any data. But everything your hyper parameters learned from the data-rich clusters will be used to predict the new cluster.

So i wouldnâ€™t fuss about building separate models to compensate for small samples in some categories. Just build the one big model that you think is right.

1 Like