Question about Hierarchical Dirichlet-Multinomial

This question has two major parts.

Question 1

I am trying to build a model on some observed categorical counts. What I really care about is the latent rate parameter that leads to these observed counts. I have been working off of this example and have been able to make it work just fine for the general case with my data. However, my data has a hierarchical structure that I would like to include into the model. I think this skittles example is a good one to work off of (I can’t really share the specifics about my problem).

Imagine that I were to add hierarchical structure. For example, we could add levels for bag size, packing facility, and year. I know this solution was never fully vetted but I am adapting it for example sake:

colors = ["red", "orange", "yellow", "green", "purple"]
data = np.array([[3, 2, 4, 23, 4],  # bag 1
                 [22, 1, 6, 2, 8],
                         ....    ,  
                 [12, 3, 16, 42, 10]) # bag 1, king size
bags = list(range(data.shape[0]))

We might now have coords that look something like this:

coords = {
          "color"    : colors, 
          "bag"      : bags,
          "size"     : [1, 1, 2, 0, 1, 0, ...],  # {0: "Fun size", 1: "Normal", 2: "King size"}
          "facility" : [0, 1, 0, 2, 1, 1, ....], # {0: "Facility A", 1: "Facility B", 2: "Facility 3"}
          "year"     : [0, 0, ... , 1 , 1]       # {0: 2023, 1: 2024}
          }

k = len(colours)
n = len(bags)

My belief is that there is some hierarchical structure over the rates such that these rates are varied slightly between bag sizes, facilities, and even year–something like,

red skittles occur with p=normal(mu=15% ,sigma=..,dims=('bag','size','facility',year'))

This example is a bit contrived but it works. My question is how would I add this additional structure to the model. This is coming from my lack of deep understanding of this parameterization of the DM model, but I don’t know if the additional dims should be added to the Dirichlet prior, the concentration, or within the DirichletMultinomial… or if convergence becomes such an issue that this simply is not realistic.

with pm.Model(coords=coords) as skittles_model:

    frac = pm.Dirichlet("frac", a=np.ones(k), dims="color")
    conc = pm.Lognormal("conc", mu=1, sigma=1)
    
    bag_of_skittles = pm.DirichletMultinomial(
        "bag_of_skittles", 
        n=data.sum(axis=1), 
        a=frac * conc, 
        observed=data, 
        dims=("bag", "color")
    )
    idata = pm.sample()

the other issue is that I don’t know how I should index properly. e.g.,

frac = pm.Dirichlet("frac", a=np.ones(k), dims=("color", "size","facility","year"))
conc = pm.Lognormal("conc", mu=1, sigma=1,dims=("size","facility","year"))

...
a = frac[color_index, size_index, facility_index, year_index]*frac[ size_index, facility_index, year_index]

Question 2

Is it possible to instead generalize this question to multiple beta binomial models where the latent probabilities sum to 1?

For example, we could instead use those variables as predictors within the model or as part of the hierarchical structure.

# Possibility A - multilevel p param for red
p_red          =  pm.Beta("_p_red", alpha=1, beta=1, dims=("size","facility","year"))

# ---------- OR ----------- #

# Possibility B - p red as a function of extra variables
X       = [size, facility, year]
fml_red = alpha + (X *beta).sum(axis=-1)
p_red   = pm.Deterministic("p_red", pm.math.invlogit(fml_red))

# ---------- OR ----------- #

# Possibility C - each variable as penalty on p ~ beta
size_param      = pm.Normal("size_param", dims=("size"))
facility _param = pm.Normal("facility_param", dims=("facility"))
year_param      = pm.Normal("year_param", dims=("year"))
_p_red          = pm.Beta("_p_red", alpha=1, beta=1)
p_red           = _p_red + size_param[size_idx] + facility _param[facility_idx] + year_param[year_idx]

# Observed red count
red_hat= pm.Binomial("red_hat", n=data.sum(axis=-1), p=p_red, observed=data[:,0])

# ---------- REPEAT FOR ALL COLORS  ----------- #

# SOME CONSTRAINT ON ALL p ~ sum(all p) = 1
p_red + p_orange, p_yellow, p_green, p_purple = 1

Alternatively, I could see a case where we use multiple regression as in this example.

If this is too abstract without actual data I am happy to put something together. I appreciate anyones thoughts on this. Thanks!

If I’m understanding correctly, since dirichlet multinomial only accepts the ‘n’ and ‘a’ parameters, a hierarchical structure with more levels would come from how you are combining your submodelling of {color,bag, size, facility, year} to get ‘n’ and ‘a’. I don’t spend too much time with Dirichlet Multinomial (by which I mean none at all other than a quick glance now and once about a year ago :sweat_smile:) but ‘n’ comes from the multinomial bit which is fixed by data(?) and so really it is how you are combining these different groups of color,bag…etc to get ‘a’. So to answer your question you are encoding that additional structure through how you are combining the additional groups you want to encode to determine a concentration and thus the Dirichlet prior.

As for actually how to do this submodelling, you mention about doing like a normal distribution over {bag, size, facility, year}, naively I’m not certain on this approach since (1) I feel like there needs to be conservation of probability across the colours so you need to have a way to enforce that/encode that in your model?, (2) how groups determining number of skittles in a bag would strongly influence concentration since the nature of concentration is that it should be relatively independent of number, (3) most of them are categorical variables whereas a multivariate normal distribution is continuous?

I could be mistaken on a few or all of these points since I’m just visiting this problem (and could also be very tired on a Friday afternoon like this one :joy:) but as a possible alternative that I’ve got in my mind mainly due to point (3) is that this submodelling is more of a regression type problem with categorical variables could be a good route?

As for part 2 of your question, the short answer is that I don’t know :sweat_smile: I think I see what you are getting at since it is a way of doing that submodelling. The only bit that jumps out at me is that since multinomial is a generalisation of the binomial distribution and multinomial makes sense since that probability conservation that is required is naturally encoded in that since probability conservation happens over the colors.

Hope these random thoughts help somewhat!

1 Like

Thanks for your thoughts on this @brandonhorsley. I think my issue comes down to introducing some kind of hierarchical structure so that I can estimate the true probability of each category, where the probability for each category is drawn from some prior distribution.

I put together a gist with a simulated example. I don’t know that my simulated data are exactly right, but the concept works. Basically I am able to recover the parameters with a pooled model, but I am unable to do so when I introduce new dims. It seems that the probabilities all converge to approximately equal. Its promising that I sample successfully w/o divergences.

One aspect of this that I am a bit unclear about is the underlying assumptions about what frac and conc fundamentally represent. This sentence from the cited DM example is helpeful:

The Dirichlet-multinomial can be understood as draws from a Multinomial distribution where each sample has a slightly different probability vector, which is itself drawn from a common Dirichlet distribution. This contrasts with the Multinomial distribution, which assumes that all observations arise from a single fixed probability vector. This enables the Dirichlet-multinomial to accommodate more variable (a.k.a, over-dispersed) count data than the Multinomial.

This is perfect in that I want to assume that the probability varies slightly among observations. In my example of the skittles, I define a true probability vector for each color and then allow the number of total skittles to vary. However, what I really want is a situation where the p(each color) is conditional on facility and bag size. In my solution, I introduce the hierarchy in the concentration. I am a bit unclear about the meaning of concentration actually. Additionally, I am not certain if the added dims should be in frac, conc or both?

As concentration goes up the DM approaches the Multinomial distribution.

You can see this by focusing on the Dirichlet alone, and seeing it reduces variability across draws (untested code):

p = [0.1, 0.3, 0.6]
for c in (1, 10, 100, 1000):
  draws = pm.draw(pm.Dirichlet.dist(p * c, size=100))
  print(draws.std(0))
1 Like

Hi @ricardoV94 thanks for your input here. Thanks for that code snippet. The code was close but here is a slightly modified version that gives the intended results.

p = np.array([0.1, 0.3, 0.6]) # list was being duplicated by c not multiplied
for c in (1, 10, 100, 1000):
  draws = pm.draw(pm.Dirichlet.dist(p * c, shape=(100,3))) # or size=100
  print(draws.std(0))

This is helpful. So a few follow ups here:

If we go back to a beta-binomial model, I could easily add the extra levels in the beta distribution. e.g.,

# Define partial pooled model
with pm.Model(coords=coords) as model:
    
    # Define distribution on phi
    phi = pm.Uniform("phi", lower=0.0, upper=1.0)
    # Define distribution on kappa.
    kappa_log = pm.Exponential("kappa_log", lam=1.5)
    kappa = pm.Deterministic("kappa", pm.math.exp(kappa_log))
    
    # probability, p
    p = pm.Beta("p", alpha=phi * kappa, beta=(1.0 - phi) * kappa, dims=("facility","bag_size"))
    
    # Define likelihood as binomial with probability, p from above
    red_hat = pm.Binomial("red_hat", n=data.total.values, p=p[data.facility_name_factorized, data.bag_size_name_factorized], observed=data.red.values)

This models samples and recovers params close to the original p(red). My question is, following this logic, should the hierarchical levels be introduced at the point of the probabilities frac, the concentration conc, or both? I am not clear what assumptions I am making with each. To put it in words, I guess I am assuming that adding dims to the dirichlet implies that the distributions for the probability of each category is conditional on those dims? i.e., the facility and bag size in my example? Then adding dims to the concentration params is controlling the spread of that distribution of probabilities for each category also conditional on the levels, bag size and facility?

So this would look something like this?

with pm.Model(coords=coords) as model2:
    # ones for alpha params should be shape == (colors, bag_size, facility)?
    frac   = pm.Dirichlet("frac", a=np.ones(len(colors), len(coords["bag_size_name"], len(coords["facility_name"]), dims=("colors", "bag_size_name", "facility_name"))

    # concentration also has added dims
    conc   = pm.Lognormal("conc", mu=1, sigma=3, dims=("facility_name", "bag_size_name","colors"))

    # _a is frac*conc with ugly indexing. Not sure the dims work out
    _a  = frac[:, data.bag_size_name_factorized, data.facility_name_factorized]*conc[data.facility_name_factorized, data.bag_size_name_factorized], observed=data[colors].values
    counts = pm.DirichletMultinomial("counts", n=data.total.values.astype("int"), a=_a,dims=("observations", "colors"))

Not completely sure about this :point_up: . I would love some feedback on whether this makes sense.

I would also love to get some thoughts on how predictors could be added to the model to help inform the probabilities for each category. This feels somewhat along the lines of a multinomial regression problem but with the added prior on probabilities from the dirichlet?

Thanks.

It’s really a model choice, but as an indication in regular GLM people almost always put hierarchies on mu and not on sigma (although both are valid). sigma tends to be harder to infer.

So in your case I would imagine pooling/predictors on the probabilities, not concentrations. But both are valid ofc!

In your BetaBinomial you pooled kappa completely as well.

1 Like

For predictors on the probabilities you may ditch the Dirichlet prior (which is basically an intercept model) and build a more advanced multinomial regression model.

ah I see, so the concentration param in DM is analogous to kappa in the Beta binom model? Not sure why this was not initially clear.

So in your case I would imagine pooling/predictors on the probabilities, not concentrations. But both are valid ofc!

This is the direction I was leaning after initially doing it on the concentration and then thinking more about the beta binomial model. However, if I were to do both, I would basically be assuming that the probabilities for each are varied among levels in the hierarchy and the spread within those is not fully pooled, right?

Regarding the more advanced regression model, would it be valid to use a combination of the dirichlet prior with a linear model or other intercepts. e.g., using the dirichlet as an intercept with other intercepts or a LM

# Linear model
X = pm.Data("X",...., dims=("observations","features"))
theta = pm.Normal("theta", mu=mu_prior, sigma=sigma_prior, dims=("features", "facility_name", "bag_size_name"))

# Intercept as penalties
intercept = pm.Normal("intercept", mu=intercept_mu_prior, sigma=intercept_sigma_prior,  dims=("colors", ...)

# Dirichlet prior
frac   = pm.Dirichlet("frac", a=np.ones(len(colors), len(coords["bag_size_name"], len(coords["facility_name"]), dims=("colors", "bag_size_name", "facility_name"))
conc   = pm.Lognormal("conc", mu=1, sigma=3, dims=("facility_name", "bag_size_name","colors"))

_a = frac*conc + intercept[indexed] + (theta[indexed]*X)

Or something more like having the linear model theta*X predict p in the multinomial directly without the dirichlet at all?

Probably bypass the dirichlet, at that point it doesn’t give you much. A softmax from logits to probabilities will be more intuitive and flexible.

1 Like

Awesome, thanks @ricardoV94 so much for all your feedback!