So I am using the penguin dataset to try and understand what should happen with the ZeroSumNormal. The following is my code (adapted from 3. Linear Models and Probabilistic Programming Languages — Bayesian Modeling and Computation in Python):
import pandas as pd
import datetime as dt
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
import pymc as pm
import arviz as az
from scipy import stats
import pytensor
pytensor.config.cxx='' #"C:\Users\wyoung1\Anaconda3\Library\mingw-w64\bin\g++.exe"
from pytensor import shared
import pytensor.tensor as at
penguins = sns.load_dataset('penguins')
# Subset to the columns needed
missing_data = penguins.isnull()[
["bill_length_mm", "flipper_length_mm", "sex", "body_mass_g"]
].any(axis=1)
# Drop rows with any missing data
penguins = penguins.loc[~missing_data]
summary_stats = (penguins.loc[:, ["species", "body_mass_g"]]
.groupby("species")
.agg(["mean", "std", "count"]))
species_one_hot = (pd.get_dummies(penguins['species'],\
columns=['species'], prefix='', prefix_sep=''))
with pm.Model() as model_penguin_mass_all_species:
# Note the addition of the shape parameter
σ = pm.HalfStudentT("σ", 100, 2000, shape=3)
μ = pm.Normal("μ", 4000, 3000, shape=3)
mass = pm.Normal("mass",
mu=μ.dot(species_one_hot.T),
sigma=σ.dot(species_one_hot.T),
observed=penguins["body_mass_g"])
trace = pm.sample()
At this point, everything is fine and I get results that make sense with the data and corresponds to the book. Then I try using the ZeroSumNormal with a dataset of only two Penguin species based what I heard on episode #74 of learning Bayesian Statistics. My assumption from there is that with two species and the following code, I should have the mean of the population as well as the difference in mean for both species:
penguins2 = penguins[penguins['species'] != 'Gentoo'].copy()
species_one_hot3 = (pd.get_dummies(penguins2['species'],
columns=['species'],
prefix=''
prefix_sep='')
)
species_one_hot3['intercept']=1
with pm.Model(coords={"predictors": species_one_hot3.columns.values}) as model_penguin_mass_all_species4:
# Note the addition of the shape parameter
σ = pm.HalfStudentT("σ", 100, 2000, dims='predictors')
μ = pm.ZeroSumNormal("μ", 3000, dims='predictors')
mass = pm.Normal("mass",
mu=μ.dot(species_one_hot3.T),
sigma=σ.dot(species_one_hot3.T),
observed=penguins2["body_mass_g"])
trace3 = pm.sample()
az.summary(trace3).round(2)
When I look at the summary of this trace, I get an intercept that is a positive addition of both population means, 7436.65, as well as negative population mens where the values seem flipped. For example I get a mean of -3730.83 for Adline which confuses me as -3730.83 is the negative value of the chinstrap mean.
Does anyone understand this distribution well enough to tell me how I have misunderstood it? Is my error simply in not creating a correct distribution for sigma as well?