# How to define Dirichlet Multinomial with different totals

Let’s say I want to find the true proportion of colors in (very small) bags of skittles, and possible colors are red, blue, and green. I have data that looks like this:

``````R B G
2 3 1
5 6 0
1 1 2
``````

If I want to characterize p(R) I can use a Dirichlet Multinomial (either with a Dirichlet + Multinomal or the DirichletMultinomial conjugate). What I don’t understand is that since it requires a single row total `n` and I have row totals of 6, 11 and 4, how would I encode this?

I’ve seen an example where n is defined as a DiscreteUniform, but it’s not clear to me if proceeding this way is problematic.

Welcome!

If you haven’t already, I would suggest checking out this notebook as a general reference.

It’s not clear what the “it” is here. The multinomial should be able to take a vector of total n’s. Something like this maybe?

``````total_counts = [len(i) for i in rows_of_data]
with pm.Model() as model_multinomial:
frac = pm.Dirichlet('frac', a=np.ones(k))
counts = pm.Multinomial('counts',
n=total_counts,
p=frac, shape=(n, k),
observed=observed_counts)
``````

Hi @cluhmann, thanks!

In your linked example it’s a little confusing since it defines a variable `n = 10 # forests observed` and then there’s separately the Multinomial term `n` which is assigned `total_count = 50`.

In my situation, `total_count` varies. Quoting the docs,

n: int or array

Number of trials (n > 0). If n is an array its shape must be (N,) with N = p.shape[0]

p: one- or two-dimensional array

Probability of each one of the different outcomes. Elements must be non-negative and sum to 1 along the last axis. They will be automatically rescaled otherwise.

So I think `n` should be the rowsum of each row of data. If it must be of shape (p.shape[0],) then I suppose it means I need to have 2-dimensional `frac`.

So then my code becomes

``````skittle_data = np.array([[2,3,1],
[5,6,0],
[1,1,2]])

total_counts = [sum(i) for i in skittle_data]
k = 3
n = 3

with pm.Model() as w_model:
frac = pm.Dirichlet("frac", a=np.ones(k), shape=(n,k))
counts = pm.Multinomial("counts",
n=total_counts,
p=frac, shape=(n,k),
observed=skittle_data)
``````

This runs but now `frac` is 2d when I wanted a 1d simplex:

`az.summary(trace_ws)`

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
frac[0, 0] 0.332 0.148 0.071 0.595 0.003 0.002 3212 2331 1
frac[0, 1] 0.444 0.154 0.15 0.72 0.003 0.002 3312 2754 1
frac[0, 2] 0.224 0.13 0.014 0.461 0.002 0.001 5144 2924 1
frac[1, 0] 0.429 0.13 0.195 0.674 0.002 0.001 5209 2796 1
frac[1, 1] 0.498 0.129 0.257 0.739 0.002 0.001 5310 3054 1
frac[1, 2] 0.073 0.068 0 0.199 0.001 0.001 2321 1405 1
frac[2, 0] 0.285 0.159 0.024 0.576 0.003 0.002 2969 2438 1
frac[2, 1] 0.287 0.158 0.018 0.568 0.003 0.002 2848 2398 1
frac[2, 2] 0.428 0.171 0.13 0.76 0.002 0.002 5290 3221 1

How would I get a single set of estimates `frac[0],frac[1],frac[2]` ?

Assuming you want all bags to be governed by the same Dirichlet, I think the above line shouldn’t need the shape argument:

``````frac = pm.Dirichlet("frac", a=np.ones(k))
``````

If I set that I get an error, `Bad initial energy`:

``````---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/Users/pmccarthy/git/inflight_fulfillment/venv/lib/python3.8/site-packages/pymc3/parallel_sampling.py", line 191, in _start_loop
point, stats = self._compute_point()
File "/Users/pmccarthy/git/inflight_fulfillment/venv/lib/python3.8/site-packages/pymc3/parallel_sampling.py", line 216, in _compute_point
point, stats = self._step_method.step(self._point)
File "/Users/pmccarthy/git/inflight_fulfillment/venv/lib/python3.8/site-packages/pymc3/step_methods/arraystep.py", line 276, in step
apoint, stats = self.astep(array)
File "/Users/pmccarthy/git/inflight_fulfillment/venv/lib/python3.8/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 159, in astep
"""
``````

It seems sometimes that happens because of NA data, so I’m guessing because then something is underspecified. Updating here while I look for a possible solution.

1 Like

@patrickjmc Can you run a prior predictive check to see if the data generated looks right and the shapes of the priors is right?

If I don’t specify the shape of the prior and only leave the alpha term, I get

``````with w_model:
prior_checks = pm.sample_prior_predictive(samples=50, random_seed=1138)

prior_checks['frac'].shape
> (50,3)
``````

if I specify a shape of `(3,3)` then I get `(50,3,3)`.

I went back to the example that started me down this path (Isaac Slavitt: Dirichlet-multinomial model for Skittle proportions) and I tried to duplicate his work line by line.

When I get to the first modeling block I hit an error:

``````color_names = df.columns

with pm.Model() as model:
# make the Dirichlet an uninformative prior
alpha = np.ones(len(color_names))
# choose true proportions
p = pm.Dirichlet("p", a=alpha)
# choose a bag size
n = pm.DiscreteUniform("n", lower=40, upper=100, observed=df.sum(axis=1).values)
# choose how many of each color to put in that bag adding up to n based on proportions p
k = pm.Multinomial("k", n=n, p=p, observed=df)

trace = pm.sample(2500)
fit = az.from_pymc3(trace=trace, coords={"p_dim_0": color_names})

color_names = df.columns

with pm.Model() as model:
# make the Dirichlet an uninformative prior
alpha = np.ones(len(color_names))
# choose true proportions
p = pm.Dirichlet("p", a=alpha)
# choose a bag size
n = pm.DiscreteUniform("n", lower=40, upper=100, observed=df.sum(axis=1).values)
# choose how many of each color to put in that bag adding up to n based on proportions p
k = pm.Multinomial("k", n=n, p=p, observed=df)

trace = pm.sample(2500)
fit = az.from_pymc3(trace=trace, coords={"p_dim_0": color_names})
color_names = df.columns
​
with pm.Model() as model:
# make the Dirichlet an uninformative prior
alpha = np.ones(len(color_names))
# choose true proportions
p = pm.Dirichlet("p", a=alpha)
# choose a bag size
n = pm.DiscreteUniform("n", lower=40, upper=100, observed=df.sum(axis=1).values)
# choose how many of each color to put in that bag adding up to n based on proportions p
k = pm.Multinomial("k", n=n, p=p, observed=df)

trace = pm.sample(2500)
fit = az.from_pymc3(trace=trace, coords={"p_dim_0": color_names})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_148063/2250084870.py in <module>
9     n = pm.DiscreteUniform("n", lower=40, upper=100, observed=df.sum(axis=1).values)
10     # choose how many of each color to put in that bag adding up to n based on proportions p
---> 11     k = pm.Multinomial("k", n=n, p=p, observed=df)
12
13     trace = pm.sample(2500)

~/miniconda3/envs/pymc/lib/python3.8/site-packages/pymc3/distributions/distribution.py in __new__(cls, name, *args, **kwargs)
119             dist = cls.dist(*args, **kwargs, shape=shape)
120         else:
--> 121             dist = cls.dist(*args, **kwargs)
122         return model.Var(name, dist, data, total_size, dims=dims)
123

~/miniconda3/envs/pymc/lib/python3.8/site-packages/pymc3/distributions/distribution.py in dist(cls, *args, **kwargs)
128     def dist(cls, *args, **kwargs):
129         dist = object.__new__(cls)
--> 130         dist.__init__(*args, **kwargs)
131         return dist
132

~/miniconda3/envs/pymc/lib/python3.8/site-packages/pymc3/distributions/multivariate.py in __init__(self, n, p, *args, **kwargs)
577             self.p = tt.as_tensor_variable(p)
578
--> 579         self.mean = self.n * self.p
580         mode = tt.cast(tt.round(self.mean), "int32")
581         diff = self.n - tt.sum(mode, axis=-1, keepdims=True)

~/miniconda3/envs/pymc/lib/python3.8/site-packages/theano/tensor/var.py in __mul__(self, other)
126         # and the return value in that case
127         try:
--> 128             return theano.tensor.mul(self, other)
129         except (NotImplementedError, TypeError):
130             return NotImplemented

~/miniconda3/envs/pymc/lib/python3.8/site-packages/theano/graph/op.py in __call__(self, *inputs, **kwargs)
251
252         if config.compute_test_value != "off":
--> 253             compute_test_value(node)
254
255         if self.default_output is not None:

~/miniconda3/envs/pymc/lib/python3.8/site-packages/theano/graph/op.py in compute_test_value(node)
128     thunk.outputs = [storage_map[v] for v in node.outputs]
129
--> 130     required = thunk()
131     assert not required  # We provided all inputs
132

~/miniconda3/envs/pymc/lib/python3.8/site-packages/theano/graph/op.py in rval()
604
605         def rval():
--> 606             thunk()
607             for o in node.outputs:
608                 compute_map[o][0] = True

1769                 print(self.error_storage, file=sys.stderr)
1770                 raise
-> 1771             raise exc_value.with_traceback(exc_trace)
1772
1773

ValueError: Input dimension mis-match. (input[0].shape[0] = 468, input[1].shape[0] = 5)
``````

Maybe related?

Intel / Conda
python 3.8.12
arviz==0.11.4
pymc3==3.11.2
theano-pymc==121.2

That’s odd. I can replicate the shape error and it doesn’t make any sense to me (at least not immediately). Can you try this and see if it works?

``````with pm.Model() as model:
alpha = np.ones(df.shape[1])
k = pm.DirichletMultinomial("k",
n=np.sum(df, axis=1),
a=alpha,
shape=df.shape,
observed=df)
``````

If you can try the beta V4 version, we have cleaned a bit the DirichletMultinomial logp and dropped the shape limitations that you mentioned in the docstrigs above

It should definitely work with multiple `n` totals. These are some combinations we are testing in our suite:

1 Like

I had a similar problem and used the code below. It worked really well on the problem I had. I found Byron Smith’s blog post here very informative.

I changed the data frame to show what you have above.

`````` observed_counts = df_skittle_bags.loc[:, ['R', 'B', 'G']]
k = observed_counts.shape[1]
n = observed_counts.shape[0]
total_counts = observed_counts.sum(axis=1)

with pm.Model() as model_dm_marginalized:

frac = pm.Dirichlet("frac", a=np.ones(k))

conc = pm.Lognormal("conc", mu=1, sigma=1)

counts = pm.DirichletMultinomial("counts", n=total_counts, a=frac * conc, shape=(n, k), observed=observed_counts)

trace = pm.sample(draws=1_000, cores=8, return_inferencedata=True)

``````