Varying number of probabilities for Geometric distribution or dealing with ragged matrices

Hi,

I’m trying to model several “geometric-like” random variables, d_i, where each trial’s probability p_j is different and sampled from a Beta distribution. The probability of such a “geometric” random variable is given by:

P(X=n) = p_n \prod_{j=1}^{n-1} (1 - p_j)

where n is the number of trials until a first success. In my case, this number is observed.

I have several such random variables d_i, each with many p_j sampled from the same beta distribution. Thus, I’m trying to write a model to fit the parameters of the beta distribution, a and b:

p_j \sim \text{beta}(a, b)
d_i \sim p_n \prod_{j=1}^{n-1} (1 - p_j)

This process can be simulated with the following code:

from numpy.random import default_rng

rng = default_rng()

n_rv = 5
probabilities = []
a = 0.4
b = 1.6

for i in range(n_rv):
    ps = []
    while True:
        p = rng.beta(a, b)
        ps.append(p)
        is_success = rng.binomial(1, p)
        if is_success:
            break
    probabilities.append(ps)

print(f"number of trial for each RV: {[len(rv) for rv in probabilities]}")

My main problem is the fact that the number of trials until a first success is different for each RV. And I don’t know how to model that using PyMC. The following work, but that’s modeling a single RV:

import pymc as pm

def vargeometric_logp(_, probabilities):
    return pm.math.log(probabilities[-1]) + (pm.math.log(1 - probabilities[:-1])).sum()

n_trials = [len(probabilities[0])]

with pm.Model() as model:
    alpha = pm.HalfNormal("alpha", sigma=1)
    beta = pm.HalfNormal("beta", sigma=1)

    rates = pm.Beta(
        "rates",
        alpha=alpha,
        beta=beta,
        size=n_trials[0],
    )

    likelihood = pm.DensityDist(
        "n_trials", rates, logp=vargeometric_logp, observed=n_trials
    )

    posterior = pm.sample(return_inferencedata=True)

The only way I found of modeling multiple RVs is with this extremelly inefficient code where I create a list of RVs:

n_trials = [len(p) for p in probabilities]

with pm.Model() as model:
    alpha = pm.HalfNormal("alpha", sigma=1)
    beta = pm.HalfNormal("beta", sigma=1)

    rates = [
        pm.Beta(
            f"rates_{index}",
            alpha=alpha,
            beta=beta,
            size=n
        )
        for index, n in enumerate(n_trials)
    ]

    likelihoods = [
        pm.DensityDist(
            f"n_trials_{index}", probs, logp=vargeometric_logp, observed=[n]
        )
        for index, (n, probs) in enumerate(zip(n_trials, rates))
    ]

    posterior = pm.sample(return_inferencedata=True)

How can I improve on this? Since each row of probabilities has a different length, I’m having a hard time modelling this. As the title implies, I’ve considered ragged matrices, but I don’t know how to use these within PyMC, as they don’t seem to be supported.

Thanks in advance!

If I understand the setup, rates should be the same length as probabilities? In this case, you could use your observed data in your likelihood function as a set of indexes that explain how to add together the rates. Here’s a proof of concept:

rates = at.dvector()
n_trials = at.ivector()

result, _ = aesara.scan(lambda n, cursor: pm.math.log(rates[cursor + n - 1]) + pm.math.log(1 - rates[cursor:cursor+n - 1]).sum(),
                      sequences=[n_trials, 
                                 at.concatenate([at.zeros(1, dtype='int'), n_trials[:-1]]).cumsum()])

logp = result.sum()
f_logp = aesara.function([rates, n_trials], logp)

Comparing this function with iterative application of your row-by-row function:

flat_data = np.concatenate([np.array(p, dtype='float64') for p in probabilities])
n_trials = np.array([len(p) for p in probabilities])
print(f_logp(flat_data, n_trials))
>>> array(-7.0629034)

print(sum([vargeometric_logp(_, np.array(probabilities[i])) for i in range(5)]))
>>> -7.062903398916366

So you would just sample Beta parameters with shape (sum(n_trials), ) once, no matter how many trials you have, then pass them all, along with the (integer!) n_trials values into the scan function to get out the correct log likelihood.

I tried setting it up but I’m not all that familiar with how DensityDist works, and I had trouble getting PyMC to accept that n_trials was an integer. Maybe I’m just missing something? You might have to write a custom distribution that inherits from DiscreteDistribution. That is above my pay grade (zero dollars).

Thanks, this is promising. I’m trying to set everything up and I’m having some issues, but I just wanted to say that the integer issue you’re having might be solved by using an lvector instead of an ivector, like so:

n_trials = at.lvector()

Thanks again. I’ll continue to try and make your suggestion work. I also edited my orignal post, I had some issues in there, namely the arguments to DensityDist.

So I think this is what I want, but it’s throwing an exception I can’t seem to fix.

import aesara
import aesara.tensor as at

_rates = at.dvector()
_n_trials = at.lvector()

def vargeometric(n, cursor, _rates):
    return pm.math.log(_rates[cursor + n - 1]) + pm.math.log(1 - _rates[cursor:cursor+n - 1]).sum()

result, _ = aesara.scan(
    vargeometric,
    sequences=[_n_trials, at.concatenate([at.zeros(1, dtype='int'), _n_trials[:-1]]).cumsum()],
    non_sequences=_rates
)

f_logp = aesara.function([_n_trials, _rates], result)

# PyMC model
n_trials = np.array([len(p) for p in probabilities])

with pm.Model() as model:
    alpha = pm.HalfNormal("alpha", sigma=1)
    beta = pm.HalfNormal("beta", sigma=1)

    rates = pm.Beta(
        "rates",
        alpha=alpha,
        beta=beta,
        size=(n_trials.sum(), ),
    )

    likelihood = pm.DensityDist(
        "n_trials", rates, logp=f_logp, observed=n_trials
    )

    posterior = pm.sample(return_inferencedata=True)

Exception is:

TypeError: Bad input argument with name "n_trials" to aesara function with name "/tmp/ipykernel_3361823/1887618496.py:17" at index 0 (0-based).  
Backtrace when that variable is created:

[...]
    _n_trials = at.lvector()
Expected an array-like object, but found a Variable: maybe you are trying to call a function on a (possibly shared) variable instead of a numeric array?

Any ideas what I’m doing wrong?

I think my example was unhelpful, I just meant to show how one can implement the scan in Aesara using dummy variables. When you actually implement it, you don’t want to compile the Aesara function first, otherwise you might as well just write it in Numpy as skip the headache. I guess the difference between pm.Potential and pm.DensityDist is that one takes a symbolic tensor and one takes a numpy function? I’m not sure. It was quite late when I posted yesterday; using pm.Potential didn’t occur to me. Taking this approach, here’s what I came up with:

import numpy as np
import aesara
import aesara.tensor as at
import pymc as pm
from scipy import stats
import arviz as az

def gen_geometric_rvs(rates, n_rv):
    n_trials = np.full(n_rv, np.nan)
    idx = 0

    for i in range(n_rv):
        ps = []
        for n, p in enumerate(rates[idx:], start=1):
            is_success = stats.bernoulli(p=p).rvs()
            idx += 1
            if is_success:
                n_trials[i] = n
                break
            
    return n_trials.astype('int')

def vargeometric(n, cursor, rates):
    return pm.math.log(rates[cursor + n - 1]) + pm.math.log(1 - rates[cursor:cursor+n - 1]).sum()

true_alpha = 0.5
true_beta = 1.4

# I draw many more rates than needed to make sure gen_geometric_rvs doesn't have any nans
many_rates = stats.beta(a=true_alpha, b=true_beta).rvs(100)

n_trials = gen_geometric_rvs(many_rates, 5)
cursors = np.r_[[0], n_trials[:-1]].cumsum()

#Recover only the rates that were used to compute trials to compare with the MCMC result
true_rates = [many_rates[cursor:cursor + n] for n, cursor in zip(n_trials, cursors)]

with pm.Model() as model:
    alpha = pm.HalfNormal("alpha", sigma=1)
    beta = pm.HalfNormal("beta", sigma=1)

    rates = pm.Beta(
        "rates",
        alpha=alpha,
        beta=beta,
        size=(n_trials.sum(), ),
    )
    
    logp_array, _ = aesara.scan(vargeometric,
                            sequences=[at.as_tensor_variable(n_trials), at.as_tensor_variable(cursors)],
                            non_sequences=rates)

    likelihood = pm.Potential('likelihood', logp_array.sum())

    trace = pm.sample(cores=6)

Here’s what I got as results.

Thanks a lot for this. I managed to make it work using DensityDist, because the number of trials is actually observed. So I can input that into DensityDist. I managed to make it work by wrapping the scan into a simple function, like so:

n_trials = np.asarray([len(p) for p in probabilities])
cursors = np.r_[[0], n_trials[:-1]].cumsum()


def vargeometric(n, cursor, rates):
    return pm.math.log(rates[cursor + n - 1]) + pm.math.log(1 - rates[cursor:cursor + n - 1]).sum()

with pm.Model() as model:
    alpha = pm.HalfNormal("alpha", sigma=1)
    beta = pm.HalfNormal("beta", sigma=1)

    rates = pm.Beta(
        "rates",
        alpha=alpha,
        beta=beta,
        size=(n_trials.sum(), ),
    )

    def logp(value, rates):
        logp_array, _ = aesara.scan(
            vargeometric,
            sequences=[at.as_tensor_variable(n_trials), at.as_tensor_variable(cursors)],
            non_sequences=rates
        )
        return logp_array

    likelihood = pm.DensityDist('likelihood', rates, logp=logp, observed=n_trials)

    trace = pm.sample(cores=6)

I guess I need to learn a bit more about aesara and how this fits within PyMC, e.g. the need for wrapping logp_array into a function.

Thanks again for helping with this issue. Much appreciated.

1 Like

Great! I’m glad it works. n_trials is also treated as observed in my example, we just use it in the computation as observed without declaring a keyword. I added the true value to the rate plots to evaluate how well the model was doing. We can only do that since it’s a synthetic example, of course. I was just curious how estimating 12 parameters from 5 data points would go.

Hi,

I’m posting this issue as a follow-up to the above because it’s very closely related. I have slightly complicated the problem by modifying the input to the scan function. Instead of iterating over a simple tensor (from Beta above), I’m now iterating over this

rates = pm.math.invlogit(intercept + flat_probabilities * weight)

where intercept and weight are random variables. Somehow, the input to the function vargeometric has a different shape, or something like that. The below code does not work (note I’ve removed the Beta RV, from what I had above):

n_trials = [len(p) for p in probabilities]
cursors = np.r_[[0], n_trials[:-1]].cumsum()

flat_probabilities = np.concatenate(probabilities)

def vargeometric(n, cursor, rates):
    return pm.math.log(rates[cursor + n - 1]) + pm.math.log(1 - rates[cursor:cursor + n - 1]).sum()

with pm.Model() as model:
    weight = pm.Normal('weight', mu=0, sigma=1)
    intercept = pm.Normal('intercept', mu=0, sigma=1)

    rates = pm.math.invlogit(intercept + flat_probabilities * weight)

    def logp(_, probabilities):
        trials, _ = ae.scan(
            vargeometric,
            sequences=[at.as_tensor_variable(n_trials), at.as_tensor_variable(cursors)],
            non_sequences=[probabilities],
        )
        return trials

    likelihood = pm.DensityDist('likelihood', rates, logp=logp, observed=n_trials)

    posterior = pm.sample()

The error is:

ERROR (aesara.graph.opt): Optimization failure due to: local_IncSubtensor_serialize
ERROR (aesara.graph.opt): node: Elemwise{add,no_inplace}(IncSubtensor{Inc;int64}.0, IncSubtensor{Inc;int64:int64:}.0)
ERROR (aesara.graph.opt): TRACEBACK:
ERROR (aesara.graph.opt): Traceback (most recent call last):
  File "/home/hdemers/.virtualenvs/grubhub3.7/lib/python3.7/site-packages/aesara/graph/opt.py", line 1850, in process_node
    replacements = lopt.transform(fgraph, node)
  File "/home/hdemers/.virtualenvs/grubhub3.7/lib/python3.7/site-packages/aesara/graph/opt.py", line 1055, in transform
    return self.fn(fgraph, node)
  File "/home/hdemers/.virtualenvs/grubhub3.7/lib/python3.7/site-packages/aesara/tensor/subtensor_opt.py", line 1216, in local_IncSubtensor_serialize
    assert mi.owner.inputs[0].type.is_super(tip.type)
AssertionError

AssertionError: 
Apply node that caused the error: for{cpu,scan_fn&grad_of_scan_fn}(TensorConstant{5}, TensorConstant{[ 3  9 11 12 13]}, TensorConstant{[ 0  4 10 12 13]}, TensorConstant{[13 12 11  9  3]}, TensorConstant{[1 1 2 6 4]}, TensorConstant{[13 12 10  4  0]}, TensorConstant{(2, 14) of 0.0}, TensorConstant{5}, Elemwise{Composite{sigmoid((i0 + (i1 * i2)))}}.0)
Toposort index: 6
Inputs types: [TensorType(int64, ()), TensorType(int64, (5,)), TensorType(int64, (5,)), TensorType(int64, (5,)), TensorType(int64, (5,)), TensorType(int64, (5,)), TensorType(float64, (2, 14)), TensorType(int64, ()), TensorType(float64, (None,))]

If I modify the vargeometric function to be:

def vargeometric(n, cursor, rates):
    return pm.math.log(rates) + pm.math.log(1 - rates).sum()

then things work properly. The only difference are the slices rates[...]. That’s why I’m thinking rates must somehow be of different shape or something.

Any ideas what I’m doing wrong?

To quickly test the above, you may use the following values for probabilities:

probabilities = [[0.06380809748241796,
  0.005120440394294163,
  0.00036320674018044583,
  0.5013973897482982],
 [0.2374231291931564,
  0.007974766903374373,
  0.11051142865327004,
  0.1323143357203916,
  2.3093956184704353e-05,
  0.624579783931357],
 [0.0014815283706698023, 0.01916080838797347],
 [0.6965125565677626],
 [0.5725363781140356]]

I think it might be related to the fact that you are constructing “rates” using a non-random variable (flat_probabilities). In effect, you now have two observed values, so you would need to adjust the logp function to reflect that. When I replaced the given flat_probabilities with random variables, I was able to sample the model again.

Interesting. And many thanks for following up.

Why having two observed values cause this behavior? Or in other words, how would I need to change logp to account for this?

In the following, I have flat_probabilities as observed to a Beta. And you’re right, if I don’t observe flat_probabilities it works.

n_trials = np.array([len(p) for p in probabilities])
cursors = np.r_[[0], n_trials[:-1]].cumsum()

flat_probabilities = np.concatenate(probabilities)

def vargeometric(n, cursor, rates):
    return pm.math.log(rates[cursor + n - 1]) + pm.math.log(1 - rates[cursor:cursor + n - 1]).sum()

with pm.Model() as model:
    alpha = pm.HalfNormal("alpha", sigma=1)
    beta = pm.HalfNormal("beta", sigma=1)
    weight = pm.Normal('weight', mu=0, sigma=1)
    intercept = pm.Normal('intercept', mu=0, sigma=1)

    observed_rates = pm.Beta(
        "observed_rates",
        alpha=alpha,
        beta=beta,
        observed=flat_probabilities,  ## Commenting out this line works
        size=(n_trials.sum(), ),
    )

    rates = pm.math.invlogit(intercept + observed_rates * weight)

    def logp(_, probabilities):
        trials, _ = ae.scan(
            vargeometric,
            sequences=[at.as_tensor_variable(n_trials), at.as_tensor_variable(cursors)],
            non_sequences=[probabilities],
        )
        return trials

    likelihood = pm.DensityDist('rates', rates, logp=logp, observed=n_trials)

    posterior = pm.sample()