Modelling a timeseries of cumulative maximum

Thank you again so much!

Let me see if I understand everything correctly, please bear with all my questions!

The marginal pdf

The pdf you provided for Y_n makes sense to me.

One easy way to see it is that

F_{Y_n}(o) = P(\max X_{1:n} \le o) = \prod_{1\le i \le n} P(X_i = n) = [F_X(o)]^{n}

hence the cdf is f_ {Y_n}(o) = F'_{Y_n}(o) = n [F_X(o)]^{n-1} f_X(o)

The joint pdf

I am a bit confused because I am not sure whether you meant if my math is wrong because there is a specific mistake or because what we need is the pdf instead of the cdf.

If the latter, can’t we derive the joint pdf by taking the partial derivatives of the cdf as usual?

Mixing random and continuous variables

According to this NIPS paper NUTS cannot be applied to mixed random variables.
About Metropolis I am confused, though this post here says it should work.

I am actually very confused on what different methods are implemented in PyMC3 and what am I doing when calling eg pm.sample(1000, step = [pm.Metropolis(), pm.NUTS()]). Is that sampling first with Metropolis and then with NUTS or doing a combination of both?

I am also confused about whether the distribution we want to model is mixed. Clearly the conditional distirbution Y_n | Y_{n-1} is, as you show in your answer. But what about the joint distribution of Y_{1:n}? And if it isn’t and we can find the pdf or approximate it numerically, could we get it working in PyMC3?

The code

I tried implementing your suggestion as follows:

import pymc3 as pm

dist =  pm.Normal.dist
N = len(obss)
logp = lambda mu, sigma, value : N + (N-1) * dist(mu, sigma).logcdf(value).sum() + dist(mu, sigma).logp(value)

with pm.Model() as model:
  mu = pm.Normal('mu', mu=10, sigma=5)
  sigma = pm.Exponential('sigma', lam=1)

  # This shows the model only the last observation. N is the number of X_i that 
  # you've drawn leading up to Y_n, which will show up in your custom distribution definition
  Yn = pm.DensityDist("Yn", logp, observed= {'mu':mu, 'sigma': sigma, 'value':obss[-1]})

  trace = pm.sample(1000, step = [pm.Metropolis(), pm.NUTS()])
  pm.traceplot(trace, ['mu', 'sigma', 'Yn'])

However I am getting an error MissingInputError: Undeclared input. What am I doing wrong?

Assuming fake independence
I wonder whether we could get sensible-enough results by pretending that each Y_i is independent and feeding that into the model. This would introduce a bias upwards in the parameters, but maybe it is maneageble. What are your thoughts?

The data
My eventual goal is to model the World Records on speedrunning competitions (where people attempt to beat a videogame as fast as possible) :slight_smile: The code to scrape the data is here, in case you are curious.

The exercise in the OP was a warmup before I think about what distribution to use for each attempt, what priors to use and how to model a minimum instead of a maximum.