Hi,
I am trying to use PyMC to estimate HDIs for a couple of variables of interest in the below problem setup.
Let us assume that we have a pool of ids coming from 5 different sources, for which we have access to some aggregated stats. We want to understand which are the sources which are more likely to provide “exclusive” ids, i.e IDs not provided by other sources. More details as below:
- We get to observe the total number of unique ids in the pool (
n_unique_id_observed
). This includes ids that were contributed exclusively by a source and ids that were contributed by one or more sources. - We assume that for each id in the pool, there is a constant probability that this id came exclusively from a given source (one of 5 sources, in this example,
p_exclusive_id_assumption
) - We model the number of ids that are from a single source (aka exclusive ids) as a binomial RV (
n_exclusive_id_assumption
) - We then assume that for each “exclusive” id, there is a constant probability that this id came from one of the 5 sources (
p_exclusivity_by_source_assumption
) - We also get to observe at the source level, the number of ids that came exclusively from each source. We model this as a multinomial RV (
n_exclusive_ids_by_source_assumption
)
The code snippet below generates data as described above:
n_unique_id_observed = 2500
p_exclusive_id_assumption = 0.10
n_exclusive_id_assumption = np.random.binomial(n=n_unique_id_observed,
p=p_exclusive_id_assumption,
size = 1)
p_exclusivity_by_source_assumption = np.random.dirichlet(np.ones(shape=5))
n_exclusive_ids_by_source_assumption = np.random.multinomial(n_exclusive_id_assumption,
p_exclusivity_by_source_assumption)
Now coming to our real inference challenges:
-
we want to estimate HDIs for
p_exclusivity_by_source
, i.e we are interested in know the range of possibilities which are feasible to answer the question “When we know that an id came from a single source, which source is it most likely to have come from?” -
we want to estimate HDIs for
p_exclusive_id
, i.e what are the range of possibilities for the fraction of the population which is “exclusive”
To do so, we try to estimate the parameters of a pymc model using the generated data above, as observed. We want to assert that p_exclusive_id_assumption
and p_exclusivity_by_source_assumption
lie within the HDI of p_exclusive_id
and p_exclusivity_by_source
:
with pm.Model() as model:
p_exclusive_id = pm.Beta('p_exclusive_id',
alpha=1.0,
beta=1.0)
n_exclusive_id = pm.Binomial('n_exclusive_id',
n=n_unique_id_observed,
p=p_exclusive_id)
p_exclusivity_by_source = pm.Dirichlet('p_exclusivity_by_supplier',
a = np.ones(5))
n_exclusive_ids_by_source = pm.Multinomial('n_exclusive_ids_by_source',
n=n_exclusive_id,
p=p_exclusivity_by_source,
observed = n_exclusive_ids_by_source_assumption)
Unfortunately, pm.sample with the above model returns an initval exception:
SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'p_exclusive_id_logodds__': array(0.), 'n_exclusive_id': array(1158), 'p_exclusivity_by_supplier_simplex__': array([ 1.60498992, -1.41315054, -0.13382439, 0.18576826])}
Initial evaluation results:
{'p_exclusive_id': -1.39, 'n_exclusive_id': -4.1, 'p_exclusivity_by_supplier': -5.66, 'n_exclusive_ids_by_source': -inf}
using the initval='prior'
keyword arg as suggested on pm.Multinomial gives Sampling Error - #3 by ricardoV94 does not seem to help either