I got all my data and can confirm that the zero-inflated categories are mostly 3 and 4: 25% of zeros in the sample – category 1’s zero-inflation is near-negligible, with about 5% zeros.
I read lots of papers, and it seems to be an active area of research! From what I understood, there are two options - none of which I know how to implement in PyMC:
- Generalizing Diallo et al., for every observation i, we assume that given the total C_{1i} + C_{2i} + ... + C_{6i} + C_{7i} = N_i, the multinomial response C_i = (C_{1i}, C_{2i}, ..., C_{7i}) is generated from the model:
\begin{equation}
C_i \sim \begin{cases}
(C_{1i}, C_{2i}, 0, C_{4i}, ..., C_{7i}), & \text{with probability $\pi$}\\
(C_{1i}, C_{2i}, C_{3i}, 0, ..., C_{7i}), & \text{with probability $\phi$}\\
mult(N_i, p_i), & \text{with probability $1- \pi - \phi$}
\end{cases}
\end{equation}
\text{where}\ p_i = (p_{1i}, ..., p_{7i}) \ \text{and}\ \Sigma_j(p_{ji}) = 1
- Reverse the thinking and model the non-zero-inflated categories (2, 5 and 6) as baseline-inflated: because they were available all the time, some of their numbers are inflated compared to their true proportion, when all categories are available. Thus, following Cai et al., if p(k|\theta) denotes a multinomial PMF with 7 categories, the model’s likelihood is:
\begin{equation}
p(Y_i = k) = \begin{cases}
\pi_k +(1 - \pi_2 - \pi_5 - \pi_6)\ p(k|\theta), & \text{if $k = 2, 5, 6$}\\
(1 - \pi_2 - \pi_5 - \pi_6)\ p(k|\theta), & \text{otherwise}
\end{cases}
\end{equation}
where k is the kth category in the multinomial distribution, and \pi_k is the probability of inflation of category k.
I prefer option 1 (more intuitive, less convoluted), but which is easier to implement in PyMC? How would I even go about it?
A huge thanks in advance for your help
(of course, please tell me if there is any mistake above)