Hi,
I’m trying to model several “geometric-like” random variables, d_i, where each trial’s probability p_j is different and sampled from a Beta distribution. The probability of such a “geometric” random variable is given by:
where n is the number of trials until a first success. In my case, this number is observed.
I have several such random variables d_i, each with many p_j sampled from the same beta distribution. Thus, I’m trying to write a model to fit the parameters of the beta distribution, a and b:
This process can be simulated with the following code:
from numpy.random import default_rng
rng = default_rng()
n_rv = 5
probabilities = []
a = 0.4
b = 1.6
for i in range(n_rv):
ps = []
while True:
p = rng.beta(a, b)
ps.append(p)
is_success = rng.binomial(1, p)
if is_success:
break
probabilities.append(ps)
print(f"number of trial for each RV: {[len(rv) for rv in probabilities]}")
My main problem is the fact that the number of trials until a first success is different for each RV. And I don’t know how to model that using PyMC. The following work, but that’s modeling a single RV:
import pymc as pm
def vargeometric_logp(_, probabilities):
return pm.math.log(probabilities[-1]) + (pm.math.log(1 - probabilities[:-1])).sum()
n_trials = [len(probabilities[0])]
with pm.Model() as model:
alpha = pm.HalfNormal("alpha", sigma=1)
beta = pm.HalfNormal("beta", sigma=1)
rates = pm.Beta(
"rates",
alpha=alpha,
beta=beta,
size=n_trials[0],
)
likelihood = pm.DensityDist(
"n_trials", rates, logp=vargeometric_logp, observed=n_trials
)
posterior = pm.sample(return_inferencedata=True)
The only way I found of modeling multiple RVs is with this extremelly inefficient code where I create a list of RVs:
n_trials = [len(p) for p in probabilities]
with pm.Model() as model:
alpha = pm.HalfNormal("alpha", sigma=1)
beta = pm.HalfNormal("beta", sigma=1)
rates = [
pm.Beta(
f"rates_{index}",
alpha=alpha,
beta=beta,
size=n
)
for index, n in enumerate(n_trials)
]
likelihoods = [
pm.DensityDist(
f"n_trials_{index}", probs, logp=vargeometric_logp, observed=[n]
)
for index, (n, probs) in enumerate(zip(n_trials, rates))
]
posterior = pm.sample(return_inferencedata=True)
How can I improve on this? Since each row of probabilities
has a different length, I’m having a hard time modelling this. As the title implies, I’ve considered ragged matrices, but I don’t know how to use these within PyMC, as they don’t seem to be supported.
Thanks in advance!