Multinomial sampling init val errors out

Hi,

I am trying to use PyMC to estimate HDIs for a couple of variables of interest in the below problem setup.

Let us assume that we have a pool of ids coming from 5 different sources, for which we have access to some aggregated stats. We want to understand which are the sources which are more likely to provide “exclusive” ids, i.e IDs not provided by other sources. More details as below:

  1. We get to observe the total number of unique ids in the pool (n_unique_id_observed). This includes ids that were contributed exclusively by a source and ids that were contributed by one or more sources.
  2. We assume that for each id in the pool, there is a constant probability that this id came exclusively from a given source (one of 5 sources, in this example, p_exclusive_id_assumption)
  3. We model the number of ids that are from a single source (aka exclusive ids) as a binomial RV (n_exclusive_id_assumption)
  4. We then assume that for each “exclusive” id, there is a constant probability that this id came from one of the 5 sources (p_exclusivity_by_source_assumption)
  5. We also get to observe at the source level, the number of ids that came exclusively from each source. We model this as a multinomial RV (n_exclusive_ids_by_source_assumption)

The code snippet below generates data as described above:

n_unique_id_observed = 2500
p_exclusive_id_assumption = 0.10
n_exclusive_id_assumption = np.random.binomial(n=n_unique_id_observed,
                                               p=p_exclusive_id_assumption,
                                               size = 1)
p_exclusivity_by_source_assumption = np.random.dirichlet(np.ones(shape=5))
n_exclusive_ids_by_source_assumption = np.random.multinomial(n_exclusive_id_assumption,
                                                               p_exclusivity_by_source_assumption)

Now coming to our real inference challenges:

  1. we want to estimate HDIs for p_exclusivity_by_source, i.e we are interested in know the range of possibilities which are feasible to answer the question “When we know that an id came from a single source, which source is it most likely to have come from?”

  2. we want to estimate HDIs for p_exclusive_id, i.e what are the range of possibilities for the fraction of the population which is “exclusive”

To do so, we try to estimate the parameters of a pymc model using the generated data above, as observed. We want to assert that p_exclusive_id_assumption and p_exclusivity_by_source_assumption lie within the HDI of p_exclusive_id and p_exclusivity_by_source:


with pm.Model() as model:
    p_exclusive_id = pm.Beta('p_exclusive_id',
                             alpha=1.0,
                             beta=1.0)
    n_exclusive_id = pm.Binomial('n_exclusive_id',
                             n=n_unique_id_observed,
                             p=p_exclusive_id)
    p_exclusivity_by_source = pm.Dirichlet('p_exclusivity_by_supplier',
                                             a = np.ones(5))
    n_exclusive_ids_by_source = pm.Multinomial('n_exclusive_ids_by_source',
                                             n=n_exclusive_id,
                                             p=p_exclusivity_by_source,
                                             observed = n_exclusive_ids_by_source_assumption)

Unfortunately, pm.sample with the above model returns an initval exception:

SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'p_exclusive_id_logodds__': array(0.), 'n_exclusive_id': array(1158), 'p_exclusivity_by_supplier_simplex__': array([ 1.60498992, -1.41315054, -0.13382439,  0.18576826])}

Initial evaluation results:
{'p_exclusive_id': -1.39, 'n_exclusive_id': -4.1, 'p_exclusivity_by_supplier': -5.66, 'n_exclusive_ids_by_source': -inf}

using the initval='prior' keyword arg as suggested on pm.Multinomial gives Sampling Error - #3 by ricardoV94 does not seem to help either

Can you share a reproducible example with data? Glancing at the code I don’t see anything wrong with it, so I suspect a data problem. Either some 0/1 probabilities that don’t match the observations or wrong n

Uploading code to reproduce (same as what I reproduce). Please note that at the moment I am using generated data as observed so I doubt it’s a data issue, but maybe I am doing something silly in the way I generate data?

example.py (3.1 KB)

hi @ricardoV94 does the attached python script help reproduce the error? Thanks so much for looking into this.

Hey @Gireesh_Ramji
To better answer this question here’s some things things that will help. One is more cersion numbers for the major packages, at least PyMC and Pytensor.

For debugging here’s two strategies. Have you tried fixing certain distributions to specific values and seeing if sampling workings then? Perhaps n_exclusive_id_assumption

Another is passing in specific test point values and seeing if you get a logp and gradient back in regions you expect. This is a screenshot from my book. The API is PyMC3, so I don’t think the code 1:1 will work, but the idea is the same. Check if the model is returning reasonable values in places you think it should.

1 Like

Thank you @RavinKumar - the versions are:
PyMC: 4.2.2
Aesara: 2.8.7

So I did find a fix and that is the observation that n_exclusive_id may also be constrained by providing its observed value as n_exclusive_id_by_source_assumption.sum()

So essential replacing

    n_exclusive_id = pm.Binomial('n_exclusive_id',
                             n=n_unique_id_observed,
                             p=p_exclusive_id)

with

    n_exclusive_id = pm.Binomial('n_exclusive_id',
                             n=n_unique_id_observed,
                             p=p_exclusive_id,
                             observed = n_exclusive_id_by_source_assumption.sum())

I guess I was expecting that providing this additional constraint to the model was not necessary as I thought the model would have converged to this value of n_exclusive_id_by_source_assumption.sum() for n_exclusive_id, on account of the specification that n_exclusive_id is the number of trials for n_exclusive_id_by_source.

However it appears that the initial value checks do not have any way to infer this implicit constraint and hence is unable to reconcile the initial (internally proposed) value for n_exclusive_id with the observed value provided for n_exclusive_id_by_source.

Been getting tripped up quite frequently with initial value checks - would really appreciate being pointed to a resource that will help me understand the overall motivation and desired impact of running the initial value checks.