How to include data into DensityDist?

Suppose I have a generalized zero inflated poisson model:

\text{Pr}(k \mid \phi, \alpha, \lambda) = \begin{cases} \text{something}, & \text{if } k = 0 \\ \text{something}, & \text{if } k = 1, 2, \dots \end{cases}

I have to define the logarithm of the likelihood. Fortunately, I can use some theano functions:

def log_(phi, alp, lam, k):
    return #something using theano functions: theano.tensor.log or
    # theano.tensor.exp

Now, my data is the frequency of movements of something, which means for 0 I have a frequency of 182, for 1 I have 41, and so on. Let’s put that in a variable:

import numpy as np
counts = np.array([182,  41,  12,  2,  2,  0,  0,  1])

The documentation says that I should use DensityDist for this, something like this:

import pymc3 as pm

with pm.Model() as model_p:
    # priors for phi, alpha and lamda
    # phi =
    # alpha =
    # lamda =

   gen_poisson = pm.DensityDist('gen_poisson', logp=logp_, 
       observed=dict(phi=phi, alp=alpha, lam=lamda))

As you may see, the function logp_ needs four arguments, only three have been passed. What about k? I don’t know. How should I include my data, counts, to DensityDist? I don’t think this makes sense:

gen_poisson = pm.DensityDist('gen_poisson', logp=logp_, 
       observed=dict(phi=phi, alp=alpha, lam=lamda, k=counts))

Remember that each datum of counts is the frequency of something. It has to be similar to this:

with pm.Model() as z_poisson:
    # priors for psi and theta
    # psi = 
    # theta = 

    # counts should be normalized before this line
    y = pm.ZeroInflatedPoisson('y', psi, theta, observed=counts) 

After reading these questions: Model definition, variable size and custom likelihood and theano fucntions and densitydist, I still don’t know how to include my data to DensityDist. What should I do?

This should be correct.

You only have to make sure that you use the correct tt.switch to handle k==0 differently, and you should be good to go.

Alternatively, you can define your own Distribution instead of using DensityDist. To do so, you should write down something like this:

class MyDistribution(pm.Discrete):
    def __init__(self, phi, alp, lam):
        self.phi = tt.as_tensor_variable(phi)
        self.alp = tt.as_tensor_variable(alp)
        self.lam = tt.as_tensor_variable(lam)

    def random(self, point=None, size=None):
        raise NotImplementedError('You have to implement this yourself')

    def logp(self, value):
        phi = self.phi
        alp = self.alp
        lam = self.lam
        # Use the same implementation as logp_
        # value will hold the observed counts

And this will work just like any other distribution in pymc3, where you pass on observed or you don’t.

Isn’t that weird?

I mean, k takes values of 0 or 1 or 2 or 3 and the output is the probability of a given number of events happens. For example, if I plot counts as probability, the result is:

With the model I showed previously, for example, for k=2, the result might be 0.03 or 0.01. With some k, the model should show something like this:

\text{Pr}(k=0 \mid (\text{something for }\phi, \theta, \lambda)) = 0.75
\text{Pr}(k=1 \mid (\text{something for }\phi, \theta, \lambda)) = 0.20
\text{Pr}(k=2 \mid (\text{something for }\phi, \theta, \lambda)) = 0.03

With this code:

gen_poisson = pm.DensityDist('gen_poisson', logp=logp_,
    observed=dict(phi=phi, alp=alpha, lam=lamda, k=counts))

I am saying that k takes values of 182 or 42, which is not right. I have to say that I am confused with this.

If you call counts.shape[0]n”, You’re saying that you observe n independent draws from your distribution, the first two of which were 182 and 41. These observations constrain the potential values of \phi, \alpha and \lambda to those which make your observations reasonably likely.