Question about aggregating by levels in hierarchical model

I am not sure if the title is very well worded, but my question relates to aggregating levels in multilevel models. I also think this question might be super straightforward but i’ve gotten myself a bit confused and now need some clarity.

Let’s say I am working with the penguins dataset commonly used by seaborn and I have the following model:

import seaborn as sns
import pandas as pd
import pymc as pm
import pymc.sampling.jax as pmjax
import numpy as np
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

rng = np.random.default_rng(RANDOM_SEED)

# Load data
df = sns.load_dataset("penguins").dropna()

# Species
species_factorized, species_list = pd.factorize(df["species"])
df["species_factorized"] = species_factorized

# Island
island_factorized, island_list = pd.factorize(df["island"])
df["island_factorized"] = island_factorized

# Penguin sex
sex_factorized, sex_list = pd.factorize(df["sex"])
df["sex_factorized"] = sex_factorized

# Standardize data
features = ["bill_length_mm", "bill_depth_mm","flipper_length_mm"]
observed = "body_mass_g"
df[features+[observed]] = scaler.fit_transform(df[features + [observed]])

# Define coords
coords = {"species"  : species_list,
          "island"   : island_list,
          "sex"      : sex_list,
          "features" : features}

# Model
with pm.Model(coords=coords) as model:
    # Define data
    species_idx = pm.ConstantData("species_idx", species_factorized      , dims='obs_id')
    island_idx  = pm.ConstantData("island_idx" , island_factorized       , dims='obs_id')
    sex_idx     = pm.ConstantData("sex_idx"    , sex_factorized          , dims='obs_id')
    X           = pm.ConstantData("data"   , df[features].to_numpy() , dims=('obs_id','features'))

    # priors
    mu0 = pm.Normal("mu0", mu=0, sigma=10)
    sigma0 = pm.HalfNormal("sigma0",1)

    # params
    beta_species = pm.Normal("beta_species", mu=mu0, sigma=sigma0, dims="species")
    beta_island  = pm.Normal("beta_island" , mu=mu0, sigma=sigma0, dims="island")
    beta_sex     = pm.Normal("beta_sex"    , mu=mu0, sigma=sigma0, dims="sex")
    beta_data    = pm.Normal("beta_data"   , mu=mu0, sigma=sigma0, dims="features")

    # linear model
    like_mean = pm.Deterministic("like_mean",beta_species[species_idx] + beta_island[island_idx] + beta_sex[sex_idx] + (X*beta_data).sum(axis=-1),dims="obs_id")
    # Likelihood
    error = pm.Exponential("error",lam=1)
    likelihood = pm.Normal("likelihood",mu=like_mean,sigma=error,observed=df[observed].to_numpy(),dims="obs_id")

    # Sample
    idata = pmjax.sample_numpyro_nuts(2000,chains=4, idata_kwargs=dict(log_likelihood = True), random_seed=RANDOM_SEED)

idata.posterior = idata["posterior"].assign_coords({"island_name"  : ("obs_id", df.island.tolist()),
                                                    "species_name" : ("obs_id", df.species.tolist()),
                                                    "sex_name"     : ("obs_id",})

I want to use this model for understanding my data as well as future prediction. My question is, lets say that I want to get the distribution of body_mass_g, but I want to aggregate out a level. For example,



species island count
Gentoo Biscoe 119
Chinstrap Dream 68
Adelie Dream 55
Adelie Torgersen 47
Adelie Biscoe 44

So lets say I want to aggregate over species to know something about a specific location or species. For example, Adelie is on multiple islands or the island of Biscoe has multiple species or I may want to ask about male/female. How can I do this in the xarray idata output? Is it as simple as indexing by species/island/sex?


 .where(idata["posterior"]["island_name"] == "Biscoe",drop=True)

What if I essentially want to do away with one of the level? For example, “remove” sex from my model.


Now that I think about this, I think my question is a bit confusing. Let me try to clarify. I did not include this in the last question, but lets assume I have unseen data or I am using the same data used for training.

I defined my likelihood mean as

beta_species[species_idx] + beta_island[island_idx] + beta_sex[sex_idx] + (X*beta_data).sum(axis=-1)

So assuming my test data are the same structure, I should not have any issue. My question is, what if I don’t have one of the features, for example, island. I have the species, the sex, and all the measurements. Is there a way for me to still use my model and also retain the same uncertainty estimates I would normally have?

I could imagine that I could apply my model with my measured data and then for each island individually. I could include all my predictions in my posterior distribution of weight.

I can’t tell if this is problematic or something that would be fair.

My goal is to build a single model for my data and then apply it at varying levels of granularity, depending on the level required for each project.

I hope this helps clarify my question.