Implementing likelihood of a TWO-inflated Poisson/hurdle distribution

Hello PyMC community,

I’m looking for advice on modelling some data from a behavioural experiment, which yields a tricky (at least to me) count distribution.

The task involves people being a set of photographs of faces, with the instruction to sort them into piles of identities. The recorded data is the number of piles someone comes up with on the task, which will be later related to some relevant predictors. One small catch here is that this data-generating process can never provide a zero count; there will at the least be one pile, indicating someone thought there all photographs of the same person. A second, larger catch is that there is in fact a “correct” number of piles, which is 2 - there are two identities but the variability of appearance tricks people.

The resulting data looks like the below, with a peak at 2 - in fact, just over half the observations are 2.

I tried to model this initially with a customised mixture, trying to emulate a ZIP. I made a custom distribution that shifted a Poisson by one and mixed it with a DiracDelta set to 2. While that samples OK, it yields a curious looking posterior predictive, where the number of 2’s is sometimes captured and sometimes not.

I think a more correct option could be a hurdle model, where the 2’s are generated by a separate process, and the rest of the counts (in the support of 1, 3, 4, 5…) are generated by a Poisson. However, I’m lost on how to get the likelihood right for this! I will obviously need a custom distribution or a pm.Potential that shifts a Poisson, but then I don’t know how to ‘delete’ 2 from the support and rescale things.

Any advice appreciated, thank you!

1 Like

Are the number of photos given to the people constant? If so, this might be better served with a multinomial distribution?

Z.

1 Like

Hey @zweli thank you, great suggestion! It is indeed the case that the number of photos given to people is constant (20 different images).

I tried to implement the multinomial but unfortunately I don’t think it would work, unless I am missing something. Drawing from a MN gives per row a vector of counts, but all I have per-row is just a single count for that person’s number of piles (the countplot above is perhaps a bit misleading in that sense).

I wonder if something like a Categorical is the way to go here. One thing I am also wrestling with is whether to worry about the fact, out of the 20 outcomes possible per-person, a chunk of them are not observed (e.g. nobody provides 14 piles or more). Future data might have those values, but would not be captured by a categorical distribution if I only use the observed data. But it feels very frequentist to worry about future data that did not happen :smiley:

1 Like

If your prior covers all 20 outcomes with non-zero probability, your posterior will also put non-zero probabilities on those outcomes (although it might be vanishingly small)

At any rate, if that’s important to you, it’s something you can control through priors.

1 Like

Mathematically, it’s simple. For y \in \mathbb{N}, \lambda > 0, \phi \in [0, 1],

\text{TwoInflatedPoisson}(y \mid \lambda, \phi) = \begin{cases} \phi & \text{if } y = 2 \\ \dfrac{1 - \phi}{1 - \textrm{Poisson}(2 \mid \lambda)} \cdot \textrm{Poisson}(y \mid \lambda) & \textrm{otherwise}\end{cases}

Generatively, you can think of simulating z \sim \textrm{Bernoulli}(\phi) and if z = 1, you generate 2, otherwise, you generate from a Poisson until you get a value that’s not 2.

Hopefully someone else will be able to help you code this in PyMC.

1 Like

Thanks @jessegrabowski - working with a Categorical works nicely (here with a simple Dirichlet prior vs the full model with predictors):


This would have to be a softmax approach, so from a regression perspective I could place priors on coefficients for each of the 20 outcomes and fix one to zero to identify it. That way future data would go through just fine, you’re right. I guess the prior would just come back with nothing to update it in the empty categories. Thank you!

2 Likes

Thanks @bob-carpenter! I thought something along those lines - the ZIP/Hurdle models look conceptually straightforward, but I’ve not got a good sense how to code one. I’d be interested if there was code around for something like this.

That looks great. The added benefit is that you can speak in terms of the probability of getting the wrong / right answer, and the probability of getting x wrong answer. Should make it easier to explain than a hurdle model.

If the order of the answers is important - it would be straightforward to move to an ordinal model.