Discrete DensityDist

Hi all,

first of all thanks a lot for the library. It’s an incredible tool and I found it very useful so far.

I’m not a big expert in the field though, so please forgive me if the question is trivial. I tried to browse both google, Discourse and the codebase without finding an answer.

I want to model some positive non-zero integers (counts > 0) as samples from a custom distribution depending on 3 parameters, and infer the parameters posterior distributions.
To do so I wrote the following:

    with pymc.Model():
        mu = pymc.Normal("mu", mu=-1, sigma=1)
        r = pymc.Exponential("r", lam=1)
        k = pymc.Gamma("k", alpha=np.log(samples.max()), beta=1)
        pymc.DensityDist("obs", mu, r, k, observed=samples, logp=customlogp)
        trace = pymc.sample(...)

where customlogp(x, mu, r, k) is an appropriate logp function and samples is the array of ints. Everything works fine and I get the estimate of my parameters. Happy.

Now, I make my model more complex, making the hypothesis that the counts I observe have been reduced by some process. I thought that a possible strategy could be to add another layer to the model, making my observed samples drawn by a BetaBinomial having the previous random variable as a maximum possible outcome (please if you think this does not make sense or there is a better way to do it, let me know :)).

I write the following:

    with pymc.Model():
        mu = pymc.Normal("mu", mu=-1, sigma=1)
        r = pymc.Exponential("r", lam=1)
        k = pymc.Gamma("k", alpha=np.log(samples.max()), beta=1)
        counts = pymc.DensityDist("ab", mu, r, k, logp=gompertz)
        alpha=pymc.Gamma("alpha",1000, 1)
        beta=pymc.Gamma("beta",1000, 1)
        pymc.BetaBinomial("obs", alpha=alpha, beta=beta, n=counts, observed=samples)
        trace = pymc.sample(..)

which breaks because apparently “ab” gets float values, both positives and negatives.

SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'mu': array(-0.16700715), 'r_log__': array(0.6423041), 'k_log__': array(0.9295108), 'ab': array(-0.89165907), 'alpha_log__': array(6.0785945), 'beta_log__': array(6.55535832)}

which makes sense, I didn’t specify anywhere counts should be giving positive integers.

I tried to fix this by specifying the random kwarg for DensityDist but apparently the sampler completely ignores it and still produces float values, positive and negative.

Which is the correct way to specify a discrete DensityDist with finite support?

Thanks a lot for your help

DensityDist accepts a dtype argument which you can set to int64. Otherwise PyMC doesn’t usually use any information about the distribution support. It should be enough to provide an initval that has a finite logp (the default is zero + jitter) and make sure your logp evaluates to -inf when it’s given an invalid value.

Something like this (writing from memory, so may need to be tweaked):

counts = pymc.DensityDist("ab", mu, r, k, logp=gompertz, dtype="int64", initval=1)

Thanks a lot for your answer @ricardoV94
Indeed, your fix works, although only with initval bigger than my maximum samples value. I think the reason is a methodological error of mine, a quite silly one. The model I wrote implies that the value of counts is the same for all the observations in samples, which is not true. There should be a hidden count for each sample that is reduced by the action of the BetaBinomial. So there are two stochastic processes at action. One generating counts through my custom distribution of which I want to determine the three parameters, and another stochastic process that for every draw of the previous, generate another value through the BetaBinomial.

I can restate the model as a DensityDist which is a convolution of the BetaBinomial and my custom probability distributions I wrote for counts. However, the new logpfunction is now terribly slow, a single call takes seconds. Is there a way to approach this kind of problem in a smarter way using PyMC? Thanks!

Can you share the actual logp/model?

Hi @ricardoV94 ,

the original logp was

def logp(value, mu, r, k):
    support = np.arange(1, 10000)
    norm = np.exp((mu - 1 - r * support) * np.log(support) + r * (1+k) * support).sum()
    log_norm = at.log(norm)
    return (mu  - 1 - r * value) * np.log(value) + r * (1+k) * value - log_norm

to convolved with the other distribution I wrote this:

def convolution(values, mu, r, k, a, b):
    unique, counts = at.extra_ops.unique(values, return_counts=True)
    support = at.arange(1, 10000)
    unnormlogp = (mu - 1 - r * support) * at.log(support) + r * (1+k) * support
    norm = at.exp(unnormlogp).sum()
    logp = unnormlogp - at.log(norm)
    logpbb = pymc.BetaBinomial.logp(unique[:, None], support, 1, 1)
    logptot = logpbb + logp
    return (at.log(at.exp(logptot).sum(axis=1)) * counts).sum()

I had to use at.extra_ops.unique to cut the pymc.BetaBinomial.logp input size by an order of magnitude and speed up the calculation (still quite slow). This approach I don’t know why produces chains that output the same value, always.

The alternative I thought is to use the following model:

    with pymc.Model():
        mu = pymc.Normal("mu", mu=-1, sigma=1)
        r = pymc.Exponential("r", lam=1)
        k = pymc.Gamma("k", alpha=np.log(samples.max()), beta=1)
        abundance = pymc.DensityDist("ab", mu, r, k, logp=logp, initval=2*samples, dtype="int64", shape=samples.size)
        alpha=pymc.Gamma("alpha",1000, 1)
        beta=pymc.Gamma("beta",1000, 1)
        pymc.BetaBinomial("obs", alpha=alpha, beta=beta, n=abundance, observed=samples)
        trace = pymc.sample(...)

but absolutely nothing converges, most of the chains drift

Thanks for the help.

Numerical integration is always tricky. Before anything else I would check if the logp is returning correct results.

You can request it via model.compile_logp(vars=[abundance], sum=False) and evaluate it by passing a point dictionary like the one returned by model.initial_point.

After that, I would start thinking whether the model is reasonable. Are there redundant parameters, hard constraints that a sampler will struggle with? How does it behave when you set some/all other parameters to the constants used in the simulated data?