How to define Dirichlet Multinomial with different totals

patrickjmc · January 4, 2022, 2:47pm

Let’s say I want to find the true proportion of colors in (very small) bags of skittles, and possible colors are red, blue, and green. I have data that looks like this:

If I want to characterize p(R) I can use a Dirichlet Multinomial (either with a Dirichlet + Multinomal or the DirichletMultinomial conjugate). What I don’t understand is that since it requires a single row total n and I have row totals of 6, 11 and 4, how would I encode this?

I’ve seen an example where n is defined as a DiscreteUniform, but it’s not clear to me if proceeding this way is problematic.

cluhmann · January 4, 2022, 3:09pm

Welcome!

If you haven’t already, I would suggest checking out this notebook as a general reference.

It’s not clear what the “it” is here. The multinomial should be able to take a vector of total n’s. Something like this maybe?

total_counts = [len(i) for i in rows_of_data]
with pm.Model() as model_multinomial:
    frac = pm.Dirichlet('frac', a=np.ones(k))
    counts = pm.Multinomial('counts',
                            n=total_counts,
                            p=frac, shape=(n, k),
                            observed=observed_counts)

patrickjmc · January 4, 2022, 3:49pm

Hi @cluhmann, thanks!

In your linked example it’s a little confusing since it defines a variable n = 10 # forests observed and then there’s separately the Multinomial term n which is assigned total_count = 50.

In my situation, total_count varies. Quoting the docs,

n: int or array

Number of trials (n > 0). If n is an array its shape must be (N,) with N = p.shape[0]

p: one- or two-dimensional array

Probability of each one of the different outcomes. Elements must be non-negative and sum to 1 along the last axis. They will be automatically rescaled otherwise.

So I think n should be the rowsum of each row of data. If it must be of shape (p.shape[0],) then I suppose it means I need to have 2-dimensional frac.

So then my code becomes

skittle_data = np.array([[2,3,1],
                         [5,6,0],
                         [1,1,2]])

total_counts = [sum(i) for i in skittle_data]
k = 3
n = 3

with pm.Model() as w_model:
    frac = pm.Dirichlet("frac", a=np.ones(k), shape=(n,k))
    counts = pm.Multinomial("counts", 
                            n=total_counts,
                            p=frac, shape=(n,k),
                            observed=skittle_data)

This runs but now frac is 2d when I wanted a 1d simplex:

az.summary(trace_ws)

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
frac[0, 0]	0.332	0.148	0.071	0.595	0.003	0.002	3212	2331	1
frac[0, 1]	0.444	0.154	0.15	0.72	0.003	0.002	3312	2754	1
frac[0, 2]	0.224	0.13	0.014	0.461	0.002	0.001	5144	2924	1
frac[1, 0]	0.429	0.13	0.195	0.674	0.002	0.001	5209	2796	1
frac[1, 1]	0.498	0.129	0.257	0.739	0.002	0.001	5310	3054	1
frac[1, 2]	0.073	0.068	0	0.199	0.001	0.001	2321	1405	1
frac[2, 0]	0.285	0.159	0.024	0.576	0.003	0.002	2969	2438	1
frac[2, 1]	0.287	0.158	0.018	0.568	0.003	0.002	2848	2398	1
frac[2, 2]	0.428	0.171	0.13	0.76	0.002	0.002	5290	3221	1

How would I get a single set of estimates frac[0],frac[1],frac[2] ?

cluhmann · January 4, 2022, 3:54pm

Assuming you want all bags to be governed by the same Dirichlet, I think the above line shouldn’t need the shape argument:

frac = pm.Dirichlet("frac", a=np.ones(k))

patrickjmc · January 4, 2022, 4:51pm

If I set that I get an error, Bad initial energy:

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/Users/pmccarthy/git/inflight_fulfillment/venv/lib/python3.8/site-packages/pymc3/parallel_sampling.py", line 191, in _start_loop
    point, stats = self._compute_point()
  File "/Users/pmccarthy/git/inflight_fulfillment/venv/lib/python3.8/site-packages/pymc3/parallel_sampling.py", line 216, in _compute_point
    point, stats = self._step_method.step(self._point)
  File "/Users/pmccarthy/git/inflight_fulfillment/venv/lib/python3.8/site-packages/pymc3/step_methods/arraystep.py", line 276, in step
    apoint, stats = self.astep(array)
  File "/Users/pmccarthy/git/inflight_fulfillment/venv/lib/python3.8/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 159, in astep
    raise SamplingError("Bad initial energy")
pymc3.exceptions.SamplingError: Bad initial energy
"""

It seems sometimes that happens because of NA data, so I’m guessing because then something is underspecified. Updating here while I look for a possible solution.

twiecki · January 6, 2022, 3:09am

@patrickjmc Can you run a prior predictive check to see if the data generated looks right and the shapes of the priors is right?

patrickjmc · January 6, 2022, 5:38am

If I don’t specify the shape of the prior and only leave the alpha term, I get

with w_model:
        prior_checks = pm.sample_prior_predictive(samples=50, random_seed=1138)

prior_checks['frac'].shape
> (50,3)

if I specify a shape of (3,3) then I get (50,3,3).

patrickjmc · January 6, 2022, 5:43am

I went back to the example that started me down this path (Isaac Slavitt: Dirichlet-multinomial model for Skittle proportions) and I tried to duplicate his work line by line.

When I get to the first modeling block I hit an error:

color_names = df.columns

with pm.Model() as model:
    # make the Dirichlet an uninformative prior
    alpha = np.ones(len(color_names))
    # choose true proportions
    p = pm.Dirichlet("p", a=alpha)
    # choose a bag size
    n = pm.DiscreteUniform("n", lower=40, upper=100, observed=df.sum(axis=1).values)
    # choose how many of each color to put in that bag adding up to n based on proportions p
    k = pm.Multinomial("k", n=n, p=p, observed=df)
    
    trace = pm.sample(2500)
    fit = az.from_pymc3(trace=trace, coords={"p_dim_0": color_names})

color_names = df.columns

with pm.Model() as model:
    # make the Dirichlet an uninformative prior
    alpha = np.ones(len(color_names))
    # choose true proportions
    p = pm.Dirichlet("p", a=alpha)
    # choose a bag size
    n = pm.DiscreteUniform("n", lower=40, upper=100, observed=df.sum(axis=1).values)
    # choose how many of each color to put in that bag adding up to n based on proportions p
    k = pm.Multinomial("k", n=n, p=p, observed=df)
    
    trace = pm.sample(2500)
    fit = az.from_pymc3(trace=trace, coords={"p_dim_0": color_names})
color_names = df.columns

with pm.Model() as model:
    # make the Dirichlet an uninformative prior
    alpha = np.ones(len(color_names))
    # choose true proportions
    p = pm.Dirichlet("p", a=alpha)
    # choose a bag size
    n = pm.DiscreteUniform("n", lower=40, upper=100, observed=df.sum(axis=1).values)
    # choose how many of each color to put in that bag adding up to n based on proportions p
    k = pm.Multinomial("k", n=n, p=p, observed=df)
    
    trace = pm.sample(2500)
    fit = az.from_pymc3(trace=trace, coords={"p_dim_0": color_names})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_148063/2250084870.py in <module>
      9     n = pm.DiscreteUniform("n", lower=40, upper=100, observed=df.sum(axis=1).values)
     10     # choose how many of each color to put in that bag adding up to n based on proportions p
---> 11     k = pm.Multinomial("k", n=n, p=p, observed=df)
     12 
     13     trace = pm.sample(2500)

~/miniconda3/envs/pymc/lib/python3.8/site-packages/pymc3/distributions/distribution.py in __new__(cls, name, *args, **kwargs)
    119             dist = cls.dist(*args, **kwargs, shape=shape)
    120         else:
--> 121             dist = cls.dist(*args, **kwargs)
    122         return model.Var(name, dist, data, total_size, dims=dims)
    123 

~/miniconda3/envs/pymc/lib/python3.8/site-packages/pymc3/distributions/distribution.py in dist(cls, *args, **kwargs)
    128     def dist(cls, *args, **kwargs):
    129         dist = object.__new__(cls)
--> 130         dist.__init__(*args, **kwargs)
    131         return dist
    132 

~/miniconda3/envs/pymc/lib/python3.8/site-packages/pymc3/distributions/multivariate.py in __init__(self, n, p, *args, **kwargs)
    577             self.p = tt.as_tensor_variable(p)
    578 
--> 579         self.mean = self.n * self.p
    580         mode = tt.cast(tt.round(self.mean), "int32")
    581         diff = self.n - tt.sum(mode, axis=-1, keepdims=True)

~/miniconda3/envs/pymc/lib/python3.8/site-packages/theano/tensor/var.py in __mul__(self, other)
    126         # and the return value in that case
    127         try:
--> 128             return theano.tensor.mul(self, other)
    129         except (NotImplementedError, TypeError):
    130             return NotImplemented

~/miniconda3/envs/pymc/lib/python3.8/site-packages/theano/graph/op.py in __call__(self, *inputs, **kwargs)
    251 
    252         if config.compute_test_value != "off":
--> 253             compute_test_value(node)
    254 
    255         if self.default_output is not None:

~/miniconda3/envs/pymc/lib/python3.8/site-packages/theano/graph/op.py in compute_test_value(node)
    128     thunk.outputs = [storage_map[v] for v in node.outputs]
    129 
--> 130     required = thunk()
    131     assert not required  # We provided all inputs
    132 

~/miniconda3/envs/pymc/lib/python3.8/site-packages/theano/graph/op.py in rval()
    604 
    605         def rval():
--> 606             thunk()
    607             for o in node.outputs:
    608                 compute_map[o][0] = True

~/miniconda3/envs/pymc/lib/python3.8/site-packages/theano/link/c/basic.py in __call__(self)
   1769                 print(self.error_storage, file=sys.stderr)
   1770                 raise
-> 1771             raise exc_value.with_traceback(exc_trace)
   1772 
   1773 

ValueError: Input dimension mis-match. (input[0].shape[0] = 468, input[1].shape[0] = 5)

Maybe related?

Intel / Conda
python 3.8.12
arviz==0.11.4
pymc3==3.11.2
theano-pymc==121.2

cluhmann · January 6, 2022, 6:38pm

That’s odd. I can replicate the shape error and it doesn’t make any sense to me (at least not immediately). Can you try this and see if it works?

with pm.Model() as model:
    alpha = np.ones(df.shape[1])
    k = pm.DirichletMultinomial("k",
                                n=np.sum(df, axis=1),
                                a=alpha,
                                shape=df.shape,
                                observed=df)

ricardoV94 · January 6, 2022, 10:02pm

If you can try the beta V4 version, we have cleaned a bit the DirichletMultinomial logp and dropped the shape limitations that you mentioned in the docstrigs above

It should definitely work with multiple n totals. These are some combinations we are testing in our suite:

github.com

pymc-devs/pymc/blob/1d7130d8cf6e419120e192f8308cf154f4c44074/pymc/tests/test_distributions.py#L2302-L2324

    
      
          @pytest.mark.parametrize("n", [(10), ([10, 11]), ([[5, 6], [10, 11]])])
          @pytest.mark.parametrize(
              "a",
              [
                  ([0.2, 0.3, 0.5]),
                  ([[0.2, 0.3, 0.5], [0.9, 0.09, 0.01]]),
                  (np.abs(np.random.randn(2, 2, 4))),
              ],
          )
          @pytest.mark.parametrize("size", [1, 2, (2, 3)])
          def test_dirichlet_multinomial_vectorized(self, n, a, size):
              n = intX(np.array(n))
              a = floatX(np.array(a))
          
          
    dm = pm.DirichletMultinomial.dist(n=n, a=a, size=size)
              vals = dm.eval()
          
          
    assert_almost_equal(
                  dirichlet_multinomial_logpmf(vals, n, a),
                  pm.logp(dm, vals).eval(),

This file has been truncated. show original

zweli · January 8, 2022, 10:34am

I had a similar problem and used the code below. It worked really well on the problem I had. I found Byron Smith’s blog post here very informative.

I changed the data frame to show what you have above.

 observed_counts = df_skittle_bags.loc[:, ['R', 'B', 'G']]
 k = observed_counts.shape[1]
 n = observed_counts.shape[0]
 total_counts = observed_counts.sum(axis=1)

 with pm.Model() as model_dm_marginalized:

    frac = pm.Dirichlet("frac", a=np.ones(k))

    conc = pm.Lognormal("conc", mu=1, sigma=1)

    counts = pm.DirichletMultinomial("counts", n=total_counts, a=frac * conc, shape=(n, k), observed=observed_counts)

    trace = pm.sample(draws=1_000, cores=8, return_inferencedata=True)

Topic		Replies	Views
Help! DM model not behaving as expected when sampling	14	187	March 15, 2024
Question about Hierarchical Dirichlet-Multinomial v5 modeling , hierarchical	9	347	April 22, 2024
Help with dimension shape ( DirichletMultinomial ) v3	0	420	May 30, 2022
How to create model with series of multinomials, each of which has observations on only a subset of components? v5 modeling	1	340	November 27, 2022
Dynamic number of Dirichlet distributions? Questions	3	609	February 25, 2018

How to define Dirichlet Multinomial with different totals

Related topics