Feeding counts or frequencies as observed data for a categorical variable

Total beginner here and not that familiar with the theory so I might be completely off base. I am trying to model some data that is basically occurrence counts over integers. I’m starting off a graphical model from the literature of my academic field which gives me a distribution over integers. The function is somewhat complex with a utility and a soft-max and it involves an idiosyncratic term with no simple form. Taking inspiration in particular from this example I was able to express it in PyMC but I can’t quite figure out how to feed it my data. What I have looks like this:

with model:
    ...
    n = pm.Data("n", value=np.arange(max_n))
    weird_factor = pm.Data("w", value=...)
    ...
    dist = ... # random vector with length max_n
    count = pm.Categorical("count", p=dist, observed=??)

So:

  1. Does this look reasonable? It seems to me that due to the idiosyncratic term I can’t do better that treat the integers like they are “data” and the final distribution as a categorical variable, is there a plausible alternative here?
  2. How do I actually feed my counts or frequencies as observations to the model? Just putting the counts as observed data did not seem to work. Or is there a better way to fit an expected distribution to an observed one than using Categorical like this?

Hi, @threepwood and welcome to the forums.

It’s hard to say without more details. For example, share the graphical model you are trying to port to PyMC.

The argument dist to the categorical should be a simplex (i.e., non-negative values that sum to 1). The observed data should then be counts of outcomes.

Thank you for the answer!

Basically it’s a utility-based model of communication. You have worlds and messages which are both essentially integers. You compute a utility for each message-world pair that looks like this:

U(m, n) = a \cdot D(m, n) + b \cdot F(n) + c \cdot G(m)

where D is some distance-like function (e.g. (m-n)^2) and F some simple function (e.g. simply n) but G is something without a closed form.

Then you get a conditional distribution over messages given worlds by taking the soft-max:

P(m | n) = \frac{\exp(U(m, n))}{\sum_n \exp(U(m, n))}

and finally you get the distribution over messages by integrating the above along with a prior over worlds:

P(m) = \sum_n P(m | n) P(n)

My current implementation is to just calculate all of the above “by hand”, with a, b, c and the parameters of P(n) as parameters of the model (the latter being in practice a geometric law, P(n) \propto \exp(- \alpha n)).

Do you mean the variable I called dist, or is there an argument called dist? If the former, yes it sums to 1 by construction. I could not figure out in what format to feed counts to the model however, what it understands is an array of outcomes, e.g. observed=[0, 0, 0, 1, 2, 2] instead of observed=[3, 1, 2] (the latter gives me -inf log-likelihood, because the model tries to calculate the likelihood of the counts but they’re too high to be possible outcomes). I was able to make my code run with made-up data based on observed frequencies but this is not very satisfying and inference gets very slow as you increase the size of the made-up data.

What I ended up doing in the end is to use a Multinomial instead with a single observation. I think it ought to be mathematically equivalent (is it?), but it runs much faster. So this:

dist = ...
n_obs = np.sum(counts_obs)
count = pm.Multinomial("count", n=n_obs, p=dist, observed=[counts_obs])

Yes, it’s mathematically equivalent up to a normalizing constant to have a sequence of independent categoricals vs. a multinomial.

\prod_{n=1}^N \text{categorical}(y_n \mid \theta) \\ \qquad \propto z(y) \sim \text{multinomial}(N, \theta),

where

\quad z(y) = \sum_{n=1}^N \text{oneHot}(y_n)

and

\quad \text{oneHot}(k) = v with v_k = 1 and v_j = 0 if j \neq k.

The Bernoulli and binomial have the same relation.

When doing that softmax calculation, you probably want to stay on the log scale and use

\quad \log p(m \mid n) = U(m, n) - \text{logSumExp}_{m=1}^M U(m, n).

Yes, you need the zeros as it’s expecting an array of the same size as the number of outcomes with counts of outcomes. The zeros drop out, because the induce factors \theta_k^0 = 1, so if you have a lot of sparsity, you can speed up the calculation by exploiting it and only computing terms involving non-zero counts if you have an efficient sparse vector representation.

As far as I can tell that’s wrong, it expects an array of arbitrary size where the elements are outcomes (integers between 0 and n_categories-1). What I would have like to feed it is precisely what you say. I don’t care about sparsity, there are no zeroes in my data.

I’m now very confused by looking at the PyMC 5.22 doc:

The doc defines the support as

x \in \{0, \ldots, n\} such that \sum_i x_i = n.

If it’s implemented like multinomial in any other system, that should read

x \in \{0, \ldots, n\}^K such that \sum_{k=1}^K x_k = n,

where K is the size of the simplex argument p (i.e., p has K non-negative elements that sum to zero).

Here’s the doc for scipy multinomial, which matches what I would expect:

Sorry for the confusion, I was talking about Categorial all along!

1 Like