Modelling Repeated Experiments until Success

Hey everyone!

I have some trouble formulating a model in PyMC. I’m trying to model a time-series that is obtained from an experiment. The dynamics is straight-forward to generate. What’s giving me issues is the fact that in the experiment, some of the tries are not actually observed.

Given a trajectory (time-series), I can give an probability that the experiment will be discarded and repeated. So, the experiment will always come up with a time-series, but I is not possible to tell how many time-series have been “generated” and rejected, because it was not recognized as a proper data set.

I need to model this rejection somehow, as the observed time-series are biased towards those that have a low probability of being missed. I know how to do that without PyMC - in that case I actually numerically integrate over all possible trajectories to determine the marginal probability of missing time-series. The Likelihood becomes p(time-series | not rejected) / (1-p(rejected)).

I’m wondering if there’s something I’m missing with PyMC and if I can formulate a model that directly generates the time-series under the observation-bias. I’m currently not seeing how this can be done in PyMC without manually calculating the 1/(1-p(rejected)) correction.

If you can write down the data generating process, you can use pm.CustomDist to automatically infer the log probability of data, given that process. That would take the correction into consideration.

It sounds like it’s a truncation problem though? Maybe you could get away with just using pm.Truncated on the time series distribution you’re interested in.

Unfortunately it’s not a truncation, I tried that initially. The data likelihood factorizes in the physical process and the rejection separately. Also, the time series are of varying length and I need to marginalize over all possible lengths. You really need to caclulate p = p(accept) + p(accept)*p(reject) + p(accept)*p(reject)^2 + … to get the correct answer.

If I understand pm.CustomDist correctly, this would then just mean running a simple Markov Chain sampler that’s not using the gradients, right? In that case, I already have that implemented myself. I think PyMC might unfortunately not be the right tool for this.

CustomDist provides gradients as long as you use pytensor operations to implement the generative graph (via the dist argument or the logp function (via the logp function). You don’t need to provide both, and dist is preferred because you get everything – logp and forward sampling. Examples here.

Integrating over the varying lengths should be possible in principle, but in practice working with ragged outputs is hard. That’s assuming the output of the distribution you’re interested in is an entire time series, and not some summary of it. Hard to say without knowing details.

1 Like