How to define Dirichlet Multinomial with different totals

Let’s say I want to find the true proportion of colors in (very small) bags of skittles, and possible colors are red, blue, and green. I have data that looks like this:

R B G
2 3 1
5 6 0
1 1 2

If I want to characterize p(R) I can use a Dirichlet Multinomial (either with a Dirichlet + Multinomal or the DirichletMultinomial conjugate). What I don’t understand is that since it requires a single row total n and I have row totals of 6, 11 and 4, how would I encode this?

I’ve seen an example where n is defined as a DiscreteUniform, but it’s not clear to me if proceeding this way is problematic.

Welcome!

If you haven’t already, I would suggest checking out this notebook as a general reference.

It’s not clear what the “it” is here. The multinomial should be able to take a vector of total n’s. Something like this maybe?

total_counts = [len(i) for i in rows_of_data]
with pm.Model() as model_multinomial:
    frac = pm.Dirichlet('frac', a=np.ones(k))
    counts = pm.Multinomial('counts',
                            n=total_counts,
                            p=frac, shape=(n, k),
                            observed=observed_counts)

Hi @cluhmann, thanks!

In your linked example it’s a little confusing since it defines a variable n = 10 # forests observed and then there’s separately the Multinomial term n which is assigned total_count = 50.

In my situation, total_count varies. Quoting the docs,

n: int or array

Number of trials (n > 0). If n is an array its shape must be (N,) with N = p.shape[0]

p: one- or two-dimensional array

Probability of each one of the different outcomes. Elements must be non-negative and sum to 1 along the last axis. They will be automatically rescaled otherwise.

So I think n should be the rowsum of each row of data. If it must be of shape (p.shape[0],) then I suppose it means I need to have 2-dimensional frac.

So then my code becomes

skittle_data = np.array([[2,3,1],
                         [5,6,0],
                         [1,1,2]])

total_counts = [sum(i) for i in skittle_data]
k = 3
n = 3

with pm.Model() as w_model:
    frac = pm.Dirichlet("frac", a=np.ones(k), shape=(n,k))
    counts = pm.Multinomial("counts", 
                            n=total_counts,
                            p=frac, shape=(n,k),
                            observed=skittle_data)

This runs but now frac is 2d when I wanted a 1d simplex:

az.summary(trace_ws)

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
frac[0, 0] 0.332 0.148 0.071 0.595 0.003 0.002 3212 2331 1
frac[0, 1] 0.444 0.154 0.15 0.72 0.003 0.002 3312 2754 1
frac[0, 2] 0.224 0.13 0.014 0.461 0.002 0.001 5144 2924 1
frac[1, 0] 0.429 0.13 0.195 0.674 0.002 0.001 5209 2796 1
frac[1, 1] 0.498 0.129 0.257 0.739 0.002 0.001 5310 3054 1
frac[1, 2] 0.073 0.068 0 0.199 0.001 0.001 2321 1405 1
frac[2, 0] 0.285 0.159 0.024 0.576 0.003 0.002 2969 2438 1
frac[2, 1] 0.287 0.158 0.018 0.568 0.003 0.002 2848 2398 1
frac[2, 2] 0.428 0.171 0.13 0.76 0.002 0.002 5290 3221 1

How would I get a single set of estimates frac[0],frac[1],frac[2] ?

Assuming you want all bags to be governed by the same Dirichlet, I think the above line shouldn’t need the shape argument:

frac = pm.Dirichlet("frac", a=np.ones(k))

If I set that I get an error, Bad initial energy:

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/pmccarthy/git/inflight_fulfillment/venv/lib/python3.8/site-packages/pymc3/parallel_sampling.py", line 191, in _start_loop
    point, stats = self._compute_point()
  File "/Users/pmccarthy/git/inflight_fulfillment/venv/lib/python3.8/site-packages/pymc3/parallel_sampling.py", line 216, in _compute_point
    point, stats = self._step_method.step(self._point)
  File "/Users/pmccarthy/git/inflight_fulfillment/venv/lib/python3.8/site-packages/pymc3/step_methods/arraystep.py", line 276, in step
    apoint, stats = self.astep(array)
  File "/Users/pmccarthy/git/inflight_fulfillment/venv/lib/python3.8/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 159, in astep
    raise SamplingError("Bad initial energy")
pymc3.exceptions.SamplingError: Bad initial energy
"""

It seems sometimes that happens because of NA data, so I’m guessing because then something is underspecified. Updating here while I look for a possible solution.

1 Like

@patrickjmc Can you run a prior predictive check to see if the data generated looks right and the shapes of the priors is right?

If I don’t specify the shape of the prior and only leave the alpha term, I get

with w_model:
        prior_checks = pm.sample_prior_predictive(samples=50, random_seed=1138)

prior_checks['frac'].shape
> (50,3)

if I specify a shape of (3,3) then I get (50,3,3).

I went back to the example that started me down this path (Isaac Slavitt: Dirichlet-multinomial model for Skittle proportions) and I tried to duplicate his work line by line.

When I get to the first modeling block I hit an error:

color_names = df.columns

with pm.Model() as model:
    # make the Dirichlet an uninformative prior
    alpha = np.ones(len(color_names))
    # choose true proportions
    p = pm.Dirichlet("p", a=alpha)
    # choose a bag size
    n = pm.DiscreteUniform("n", lower=40, upper=100, observed=df.sum(axis=1).values)
    # choose how many of each color to put in that bag adding up to n based on proportions p
    k = pm.Multinomial("k", n=n, p=p, observed=df)
    
    trace = pm.sample(2500)
    fit = az.from_pymc3(trace=trace, coords={"p_dim_0": color_names})

color_names = df.columns

with pm.Model() as model:
    # make the Dirichlet an uninformative prior
    alpha = np.ones(len(color_names))
    # choose true proportions
    p = pm.Dirichlet("p", a=alpha)
    # choose a bag size
    n = pm.DiscreteUniform("n", lower=40, upper=100, observed=df.sum(axis=1).values)
    # choose how many of each color to put in that bag adding up to n based on proportions p
    k = pm.Multinomial("k", n=n, p=p, observed=df)
    
    trace = pm.sample(2500)
    fit = az.from_pymc3(trace=trace, coords={"p_dim_0": color_names})
color_names = df.columns
​
with pm.Model() as model:
    # make the Dirichlet an uninformative prior
    alpha = np.ones(len(color_names))
    # choose true proportions
    p = pm.Dirichlet("p", a=alpha)
    # choose a bag size
    n = pm.DiscreteUniform("n", lower=40, upper=100, observed=df.sum(axis=1).values)
    # choose how many of each color to put in that bag adding up to n based on proportions p
    k = pm.Multinomial("k", n=n, p=p, observed=df)
    
    trace = pm.sample(2500)
    fit = az.from_pymc3(trace=trace, coords={"p_dim_0": color_names})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_148063/2250084870.py in <module>
      9     n = pm.DiscreteUniform("n", lower=40, upper=100, observed=df.sum(axis=1).values)
     10     # choose how many of each color to put in that bag adding up to n based on proportions p
---> 11     k = pm.Multinomial("k", n=n, p=p, observed=df)
     12 
     13     trace = pm.sample(2500)

~/miniconda3/envs/pymc/lib/python3.8/site-packages/pymc3/distributions/distribution.py in __new__(cls, name, *args, **kwargs)
    119             dist = cls.dist(*args, **kwargs, shape=shape)
    120         else:
--> 121             dist = cls.dist(*args, **kwargs)
    122         return model.Var(name, dist, data, total_size, dims=dims)
    123 

~/miniconda3/envs/pymc/lib/python3.8/site-packages/pymc3/distributions/distribution.py in dist(cls, *args, **kwargs)
    128     def dist(cls, *args, **kwargs):
    129         dist = object.__new__(cls)
--> 130         dist.__init__(*args, **kwargs)
    131         return dist
    132 

~/miniconda3/envs/pymc/lib/python3.8/site-packages/pymc3/distributions/multivariate.py in __init__(self, n, p, *args, **kwargs)
    577             self.p = tt.as_tensor_variable(p)
    578 
--> 579         self.mean = self.n * self.p
    580         mode = tt.cast(tt.round(self.mean), "int32")
    581         diff = self.n - tt.sum(mode, axis=-1, keepdims=True)

~/miniconda3/envs/pymc/lib/python3.8/site-packages/theano/tensor/var.py in __mul__(self, other)
    126         # and the return value in that case
    127         try:
--> 128             return theano.tensor.mul(self, other)
    129         except (NotImplementedError, TypeError):
    130             return NotImplemented

~/miniconda3/envs/pymc/lib/python3.8/site-packages/theano/graph/op.py in __call__(self, *inputs, **kwargs)
    251 
    252         if config.compute_test_value != "off":
--> 253             compute_test_value(node)
    254 
    255         if self.default_output is not None:

~/miniconda3/envs/pymc/lib/python3.8/site-packages/theano/graph/op.py in compute_test_value(node)
    128     thunk.outputs = [storage_map[v] for v in node.outputs]
    129 
--> 130     required = thunk()
    131     assert not required  # We provided all inputs
    132 

~/miniconda3/envs/pymc/lib/python3.8/site-packages/theano/graph/op.py in rval()
    604 
    605         def rval():
--> 606             thunk()
    607             for o in node.outputs:
    608                 compute_map[o][0] = True

~/miniconda3/envs/pymc/lib/python3.8/site-packages/theano/link/c/basic.py in __call__(self)
   1769                 print(self.error_storage, file=sys.stderr)
   1770                 raise
-> 1771             raise exc_value.with_traceback(exc_trace)
   1772 
   1773 

ValueError: Input dimension mis-match. (input[0].shape[0] = 468, input[1].shape[0] = 5)

Maybe related?

Intel / Conda
python 3.8.12
arviz==0.11.4
pymc3==3.11.2
theano-pymc==121.2

That’s odd. I can replicate the shape error and it doesn’t make any sense to me (at least not immediately). Can you try this and see if it works?

with pm.Model() as model:
    alpha = np.ones(df.shape[1])
    k = pm.DirichletMultinomial("k",
                                n=np.sum(df, axis=1),
                                a=alpha,
                                shape=df.shape,
                                observed=df)

If you can try the beta V4 version, we have cleaned a bit the DirichletMultinomial logp and dropped the shape limitations that you mentioned in the docstrigs above

It should definitely work with multiple n totals. These are some combinations we are testing in our suite:

1 Like

I had a similar problem and used the code below. It worked really well on the problem I had. I found Byron Smith’s blog post here very informative.

I changed the data frame to show what you have above.

 observed_counts = df_skittle_bags.loc[:, ['R', 'B', 'G']]
 k = observed_counts.shape[1]
 n = observed_counts.shape[0]
 total_counts = observed_counts.sum(axis=1)

 with pm.Model() as model_dm_marginalized:

    frac = pm.Dirichlet("frac", a=np.ones(k))

    conc = pm.Lognormal("conc", mu=1, sigma=1)

    counts = pm.DirichletMultinomial("counts", n=total_counts, a=frac * conc, shape=(n, k), observed=observed_counts)

    trace = pm.sample(draws=1_000, cores=8, return_inferencedata=True)