Question about aggregating by levels in hierarchical model

JAB · October 4, 2023, 8:47pm

I am not sure if the title is very well worded, but my question relates to aggregating levels in multilevel models. I also think this question might be super straightforward but i’ve gotten myself a bit confused and now need some clarity.

Let’s say I am working with the penguins dataset commonly used by seaborn and I have the following model:

import seaborn as sns
import pandas as pd
import pymc as pm
import pymc.sampling.jax as pmjax
import numpy as np
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

RANDOM_SEED = 101
rng = np.random.default_rng(RANDOM_SEED)

# Load data
df = sns.load_dataset("penguins").dropna()

# Species
species_factorized, species_list = pd.factorize(df["species"])
df["species_factorized"] = species_factorized

# Island
island_factorized, island_list = pd.factorize(df["island"])
df["island_factorized"] = island_factorized

# Penguin sex
sex_factorized, sex_list = pd.factorize(df["sex"])
df["sex_factorized"] = sex_factorized

# Standardize data
features = ["bill_length_mm", "bill_depth_mm","flipper_length_mm"]
observed = "body_mass_g"
df[features+[observed]] = scaler.fit_transform(df[features + [observed]])

# Define coords
coords = {"species"  : species_list,
          "island"   : island_list,
          "sex"      : sex_list,
          "features" : features}

# Model
with pm.Model(coords=coords) as model:
    # Define data
    species_idx = pm.ConstantData("species_idx", species_factorized      , dims='obs_id')
    island_idx  = pm.ConstantData("island_idx" , island_factorized       , dims='obs_id')
    sex_idx     = pm.ConstantData("sex_idx"    , sex_factorized          , dims='obs_id')
    X           = pm.ConstantData("data"   , df[features].to_numpy() , dims=('obs_id','features'))

    # priors
    mu0 = pm.Normal("mu0", mu=0, sigma=10)
    sigma0 = pm.HalfNormal("sigma0",1)

    # params
    beta_species = pm.Normal("beta_species", mu=mu0, sigma=sigma0, dims="species")
    beta_island  = pm.Normal("beta_island" , mu=mu0, sigma=sigma0, dims="island")
    beta_sex     = pm.Normal("beta_sex"    , mu=mu0, sigma=sigma0, dims="sex")
    beta_data    = pm.Normal("beta_data"   , mu=mu0, sigma=sigma0, dims="features")

    # linear model
    like_mean = pm.Deterministic("like_mean",beta_species[species_idx] + beta_island[island_idx] + beta_sex[sex_idx] + (X*beta_data).sum(axis=-1),dims="obs_id")
    
    # Likelihood
    error = pm.Exponential("error",lam=1)
    likelihood = pm.Normal("likelihood",mu=like_mean,sigma=error,observed=df[observed].to_numpy(),dims="obs_id")

    # Sample
    idata = pmjax.sample_numpyro_nuts(2000,chains=4, idata_kwargs=dict(log_likelihood = True), random_seed=RANDOM_SEED)

idata.posterior = idata["posterior"].assign_coords({"island_name"  : ("obs_id", df.island.tolist()),
                                                    "species_name" : ("obs_id", df.species.tolist()),
                                                    "sex_name"     : ("obs_id", df.sex.tolist())})

I want to use this model for understanding my data as well as future prediction. My question is, lets say that I want to get the distribution of body_mass_g, but I want to aggregate out a level. For example,

pd.DataFrame(df[["species","island"]].value_counts())

returns:

species	island	count
Gentoo	Biscoe	119
Chinstrap	Dream	68
Adelie	Dream	55
Adelie	Torgersen	47
Adelie	Biscoe	44

So lets say I want to aggregate over species to know something about a specific location or species. For example, Adelie is on multiple islands or the island of Biscoe has multiple species or I may want to ask about male/female. How can I do this in the xarray idata output? Is it as simple as indexing by species/island/sex?

e.g.,

(idata["posterior"]
 .where(idata["posterior"]["island_name"] == "Biscoe",drop=True)
 .like_mean.mean(("chain","draw")))

What if I essentially want to do away with one of the level? For example, “remove” sex from my model.

Thanks!

JAB · October 5, 2023, 4:58pm

Now that I think about this, I think my question is a bit confusing. Let me try to clarify. I did not include this in the last question, but lets assume I have unseen data or I am using the same data used for training.

I defined my likelihood mean as

beta_species[species_idx] + beta_island[island_idx] + beta_sex[sex_idx] + (X*beta_data).sum(axis=-1)

So assuming my test data are the same structure, I should not have any issue. My question is, what if I don’t have one of the features, for example, island. I have the species, the sex, and all the measurements. Is there a way for me to still use my model and also retain the same uncertainty estimates I would normally have?

I could imagine that I could apply my model with my measured data and then for each island individually. I could include all my predictions in my posterior distribution of weight.

I can’t tell if this is problematic or something that would be fair.

My goal is to build a single model for my data and then apply it at varying levels of granularity, depending on the level required for each project.

I hope this helps clarify my question.

Topic		Replies	Views
Multi-Multilevel Modeling Questions	9	2667	February 15, 2023
Trying to use ZeroSUMNormal and not understanding results v5 modeling	8	582	May 2, 2023
Hierarchical model and pandas dataframe Questions	7	2356	July 29, 2021
How do I correctly add an additional level to a hierarchical model?	4	582	July 14, 2022
Basic population and group level estimation example: suggestions for improvements? version agnostic model-checking	6	604	April 8, 2023

Question about aggregating by levels in hierarchical model

Related topics