I am not sure if the title is very well worded, but my question relates to aggregating levels in multilevel models. I also think this question might be super straightforward but i’ve gotten myself a bit confused and now need some clarity.
Let’s say I am working with the penguins dataset commonly used by seaborn and I have the following model:
import seaborn as sns
import pandas as pd
import pymc as pm
import pymc.sampling.jax as pmjax
import numpy as np
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
RANDOM_SEED = 101
rng = np.random.default_rng(RANDOM_SEED)
# Load data
df = sns.load_dataset("penguins").dropna()
# Species
species_factorized, species_list = pd.factorize(df["species"])
df["species_factorized"] = species_factorized
# Island
island_factorized, island_list = pd.factorize(df["island"])
df["island_factorized"] = island_factorized
# Penguin sex
sex_factorized, sex_list = pd.factorize(df["sex"])
df["sex_factorized"] = sex_factorized
# Standardize data
features = ["bill_length_mm", "bill_depth_mm","flipper_length_mm"]
observed = "body_mass_g"
df[features+[observed]] = scaler.fit_transform(df[features + [observed]])
# Define coords
coords = {"species" : species_list,
"island" : island_list,
"sex" : sex_list,
"features" : features}
# Model
with pm.Model(coords=coords) as model:
# Define data
species_idx = pm.ConstantData("species_idx", species_factorized , dims='obs_id')
island_idx = pm.ConstantData("island_idx" , island_factorized , dims='obs_id')
sex_idx = pm.ConstantData("sex_idx" , sex_factorized , dims='obs_id')
X = pm.ConstantData("data" , df[features].to_numpy() , dims=('obs_id','features'))
# priors
mu0 = pm.Normal("mu0", mu=0, sigma=10)
sigma0 = pm.HalfNormal("sigma0",1)
# params
beta_species = pm.Normal("beta_species", mu=mu0, sigma=sigma0, dims="species")
beta_island = pm.Normal("beta_island" , mu=mu0, sigma=sigma0, dims="island")
beta_sex = pm.Normal("beta_sex" , mu=mu0, sigma=sigma0, dims="sex")
beta_data = pm.Normal("beta_data" , mu=mu0, sigma=sigma0, dims="features")
# linear model
like_mean = pm.Deterministic("like_mean",beta_species[species_idx] + beta_island[island_idx] + beta_sex[sex_idx] + (X*beta_data).sum(axis=-1),dims="obs_id")
# Likelihood
error = pm.Exponential("error",lam=1)
likelihood = pm.Normal("likelihood",mu=like_mean,sigma=error,observed=df[observed].to_numpy(),dims="obs_id")
# Sample
idata = pmjax.sample_numpyro_nuts(2000,chains=4, idata_kwargs=dict(log_likelihood = True), random_seed=RANDOM_SEED)
idata.posterior = idata["posterior"].assign_coords({"island_name" : ("obs_id", df.island.tolist()),
"species_name" : ("obs_id", df.species.tolist()),
"sex_name" : ("obs_id", df.sex.tolist())})
I want to use this model for understanding my data as well as future prediction. My question is, lets say that I want to get the distribution of body_mass_g
, but I want to aggregate out a level. For example,
pd.DataFrame(df[["species","island"]].value_counts())
returns:
species | island | count |
---|---|---|
Gentoo | Biscoe | 119 |
Chinstrap | Dream | 68 |
Adelie | Dream | 55 |
Adelie | Torgersen | 47 |
Adelie | Biscoe | 44 |
So lets say I want to aggregate over species to know something about a specific location or species. For example, Adelie is on multiple islands or the island of Biscoe has multiple species or I may want to ask about male/female. How can I do this in the xarray idata
output? Is it as simple as indexing by species/island/sex?
e.g.,
(idata["posterior"]
.where(idata["posterior"]["island_name"] == "Biscoe",drop=True)
.like_mean.mean(("chain","draw")))
What if I essentially want to do away with one of the level? For example, “remove” sex from my model.
Thanks!