Modelling Repeated Experiments until Success

Canonical · March 12, 2024, 8:37pm

Hey everyone!

I have some trouble formulating a model in PyMC. I’m trying to model a time-series that is obtained from an experiment. The dynamics is straight-forward to generate. What’s giving me issues is the fact that in the experiment, some of the tries are not actually observed.

Given a trajectory (time-series), I can give an probability that the experiment will be discarded and repeated. So, the experiment will always come up with a time-series, but I is not possible to tell how many time-series have been “generated” and rejected, because it was not recognized as a proper data set.

I need to model this rejection somehow, as the observed time-series are biased towards those that have a low probability of being missed. I know how to do that without PyMC - in that case I actually numerically integrate over all possible trajectories to determine the marginal probability of missing time-series. The Likelihood becomes p(time-series | not rejected) / (1-p(rejected)).

I’m wondering if there’s something I’m missing with PyMC and if I can formulate a model that directly generates the time-series under the observation-bias. I’m currently not seeing how this can be done in PyMC without manually calculating the 1/(1-p(rejected)) correction.

jessegrabowski · March 13, 2024, 11:53am

If you can write down the data generating process, you can use pm.CustomDist to automatically infer the log probability of data, given that process. That would take the correction into consideration.

It sounds like it’s a truncation problem though? Maybe you could get away with just using pm.Truncated on the time series distribution you’re interested in.

Canonical · March 13, 2024, 1:13pm

Unfortunately it’s not a truncation, I tried that initially. The data likelihood factorizes in the physical process and the rejection separately. Also, the time series are of varying length and I need to marginalize over all possible lengths. You really need to caclulate p = p(accept) + p(accept)*p(reject) + p(accept)*p(reject)^2 + … to get the correct answer.

If I understand pm.CustomDist correctly, this would then just mean running a simple Markov Chain sampler that’s not using the gradients, right? In that case, I already have that implemented myself. I think PyMC might unfortunately not be the right tool for this.

jessegrabowski · March 13, 2024, 1:33pm

CustomDist provides gradients as long as you use pytensor operations to implement the generative graph (via the dist argument or the logp function (via the logp function). You don’t need to provide both, and dist is preferred because you get everything – logp and forward sampling. Examples here.

Integrating over the varying lengths should be possible in principle, but in practice working with ragged outputs is hard. That’s assuming the output of the distribution you’re interested in is an entire time series, and not some summary of it. Hard to say without knowing details.

Topic		Replies	Views
Modelling a Time Series of Records with PyMC3 Sharing	1	1448	September 24, 2021
How to update PyMC model with counterfactuals? v5	6	572	May 20, 2023
Multiple Time Series Questions	3	925	April 27, 2021
Binomial Time Series Questions	7	1446	July 7, 2020
Statistical Rethinking Ch 3 Q1 Code / observed = data errors / beginner question v3 modeling	6	440	June 2, 2022

Modelling Repeated Experiments until Success

Related topics