This question has two major parts.
Question 1
I am trying to build a model on some observed categorical counts. What I really care about is the latent rate parameter that leads to these observed counts. I have been working off of this example and have been able to make it work just fine for the general case with my data. However, my data has a hierarchical structure that I would like to include into the model. I think this skittles example is a good one to work off of (I can’t really share the specifics about my problem).
Imagine that I were to add hierarchical structure. For example, we could add levels for bag size, packing facility, and year. I know this solution was never fully vetted but I am adapting it for example sake:
colors = ["red", "orange", "yellow", "green", "purple"]
data = np.array([[3, 2, 4, 23, 4], # bag 1
[22, 1, 6, 2, 8],
.... ,
[12, 3, 16, 42, 10]) # bag 1, king size
bags = list(range(data.shape[0]))
We might now have coords that look something like this:
coords = {
"color" : colors,
"bag" : bags,
"size" : [1, 1, 2, 0, 1, 0, ...], # {0: "Fun size", 1: "Normal", 2: "King size"}
"facility" : [0, 1, 0, 2, 1, 1, ....], # {0: "Facility A", 1: "Facility B", 2: "Facility 3"}
"year" : [0, 0, ... , 1 , 1] # {0: 2023, 1: 2024}
}
k = len(colours)
n = len(bags)
My belief is that there is some hierarchical structure over the rates such that these rates are varied slightly between bag sizes, facilities, and even year–something like,
red skittles occur with p=normal(mu=15% ,sigma=..,dims=('bag','size','facility',year'))
This example is a bit contrived but it works. My question is how would I add this additional structure to the model. This is coming from my lack of deep understanding of this parameterization of the DM model, but I don’t know if the additional dims should be added to the Dirichlet prior, the concentration, or within the DirichletMultinomial… or if convergence becomes such an issue that this simply is not realistic.
with pm.Model(coords=coords) as skittles_model:
frac = pm.Dirichlet("frac", a=np.ones(k), dims="color")
conc = pm.Lognormal("conc", mu=1, sigma=1)
bag_of_skittles = pm.DirichletMultinomial(
"bag_of_skittles",
n=data.sum(axis=1),
a=frac * conc,
observed=data,
dims=("bag", "color")
)
idata = pm.sample()
the other issue is that I don’t know how I should index properly. e.g.,
frac = pm.Dirichlet("frac", a=np.ones(k), dims=("color", "size","facility","year"))
conc = pm.Lognormal("conc", mu=1, sigma=1,dims=("size","facility","year"))
...
a = frac[color_index, size_index, facility_index, year_index]*frac[ size_index, facility_index, year_index]
Question 2
Is it possible to instead generalize this question to multiple beta binomial models where the latent probabilities sum to 1?
For example, we could instead use those variables as predictors within the model or as part of the hierarchical structure.
# Possibility A - multilevel p param for red
p_red = pm.Beta("_p_red", alpha=1, beta=1, dims=("size","facility","year"))
# ---------- OR ----------- #
# Possibility B - p red as a function of extra variables
X = [size, facility, year]
fml_red = alpha + (X *beta).sum(axis=-1)
p_red = pm.Deterministic("p_red", pm.math.invlogit(fml_red))
# ---------- OR ----------- #
# Possibility C - each variable as penalty on p ~ beta
size_param = pm.Normal("size_param", dims=("size"))
facility _param = pm.Normal("facility_param", dims=("facility"))
year_param = pm.Normal("year_param", dims=("year"))
_p_red = pm.Beta("_p_red", alpha=1, beta=1)
p_red = _p_red + size_param[size_idx] + facility _param[facility_idx] + year_param[year_idx]
# Observed red count
red_hat= pm.Binomial("red_hat", n=data.sum(axis=-1), p=p_red, observed=data[:,0])
# ---------- REPEAT FOR ALL COLORS ----------- #
# SOME CONSTRAINT ON ALL p ~ sum(all p) = 1
p_red + p_orange, p_yellow, p_green, p_purple = 1
Alternatively, I could see a case where we use multiple regression as in this example.
If this is too abstract without actual data I am happy to put something together. I appreciate anyones thoughts on this. Thanks!