Hey guys,
I am at the pre-modelling stage trying to wrap my head around how the structure of the model should look like …
Abstract description of the data is like this. Suppose there is some rate of something that occurs (say per day) lam = rate … but suppose this something that occurs can be broken into sub-categories 1 through n each with a different rate lam_1 … lam_n … and the total rate is the sum of the rates of the sub-processes lam = sum_i lam_i … and then suppose that each category i, can be broken further into sub categories so that each lam_j is the sum of smaller rates lam_j = sum_k lam_jk …
concrete example: In hockey the number of shots on net per game might be lam ~ N(32, 7)
and let’s say there are 3 types of shots: high, med, low danger (danger = probability of goal)
so lam = lam_1 + lam_2 + lam_3 … and now suppose that high danger breaks into 3 sub-vategories lam_1 = lam_11+ lam12+lam_13 … and medium, low also break into some sub-categories
my question is: In this sort of set up, it possible to make a hierarchical model where lam is estimated … and then lam_1, lam_2, lam_3 are estimated in such a way that lam_1+lam_2+lam_3 = lam and then each lam_1j are estimated in such a way that sum_j lam_1j = lam_1 … and so on … ?
So basically in a typical hierarchical model, you would have a global mean mu, and then a mean mu_j for several sub-categories but the mu_j would be some deviation from the global mean …
In this set up we have several sub-rates lam_j, but they are related to the global rate
via lam_global = Sum_j lam_j …
[side note: Ah, I see this is just one constraint for all the lam_j, so this is not giving enough constraints… So, I suppose we want to know the average lam_j^global and use that to constrain each lam_j … or alternatively the global average percentages p1 , pk such that lam_j = pj lam_global … so this suggests a sequence of dirichelt distributions Diric(p1 … pk) for each sub-category … ]
The reason I want to do this is that if you ignored these nested categories and just went to finest level of categories (say 40 types of shots) then you get a bunch of very small rates with high variance that are highly correlated … (for example, if you a lot of high danger, sub category 1, then you should also have a lot of high danger, sub category 2) … so estimating the rates in bigger categories and then using that as a basis for the smaller sub-categories at each step is meant to try to tie the levels together and help “regularize” these highly correlated rates …
Any thoughts on how one might go about this would be appreciated … (forgive me if this explanation is not concrete enough to follow; if there is some interest I will look into providing more explicit data/example)