Modelling a timeseries of cumulative maximum

Jaime_Sevilla · August 1, 2021, 7:33pm

I am trying to model a process where I only get to see the historical maximums at any point in time.

A simple generative setup in numpy for the data is as follows:

import numpy as np
import math

# Model parameters
mu = 10.
sigma = 1.

# Sample generation
ts = np.arange(100)
obss = []
obs = -math.inf
for t in ts:
  aux = mu + sigma*np.random.randn()
  obs = np.max([obs, aux])
  obss.append(obs)
obss = np.array(obss)

import matplotlib.pyplot as plt
plt.plot(ts, obss);

Now I want to code this as as PyMC3 model, fit the data and forecast the extrapolated value 10 timesteps after the last observation. I tried the following:

t_forecast = 110

with pm.Model() as model:
  mu = pm.Normal('mu', mu=10, sigma=5)
  sigma = pm.Exponential('sigma', lam=1)
  records = []

  record = -pm.distributions.Constant('INITIAL FAKE RECORD', -10000)
  # Introduce observations
  for (t, obs) in zip(ts,obss):
    node_name = f'step {t//STEP} record'
    record = pm.math.maximum(record, pm.Normal(node_name, mu, sigma), 
                             observed = observation) # This doesn't work!
    records.append(record)
  
  # Introduce forecasted variable
  while t < x_forecast:
    t+= STEP
    record = pm.math.maximum(record, pm.Normal(f'step {t//STEP} record', mu, sigma))

  # Sample posterior
  trace = pm.sample(1000, step = [pm.Metropolis(), pm.NUTS()])
  pm.traceplot(trace, ['mu', 'beta', f'step {t} record'])

However this doesn’t quite work - one cannot feed the observations like that.
According to this previous answer I would need then to compute the joint distribution of all the timesteps.

I tried working this out in a more general setting. Let’s have X_n be some iid variables with cdf F_X, and let’s define Y_n = \max_{i\le n} X_i.

Then we have that

P(Y_1\le o_1, Y_2\le o_2, ..., Y_n \le o_n) = \\ P(Y_1\le o_1) P(Y_2\le o_2 | Y_1\le o_1) \cdots P(Y_n\le o_n | Y_1\le o_1, \dots, Y_{n-1}\le o_{n-1})

Expanding the first few terms gives us

P(Y_1\le o_1) = P(X_1 \le o_1) = F_X(o_1)

P(Y_2\le o_2 | Y_1\le o_1) = P(\max(X_1, X_2) \le o_2 | X_1 \le o_1) = \\ P(X_2 \le o_2) P(X_1 \le o_2 | X_1 \le o_1) = F_X(o_2) \frac{F_X(\min\{o_1, o_2\})}{F_X(o_1)}

P(Y_3\le o_3 | Y_1\le o_1, Y_2\le o_2) = P(\max(X_1, X_2, X_3) \le o_3 | X_1\le o_1, \min(X_1, X_2)\le o_2) = \\ P(X_3 \le o_3) P(X_2 \le o_3 | X_2 \le o_2) P(X_1 \le o_3 | X_1 \le \min(o_1, o_2)) = \\ F_X(o_3) \frac{F_X(\min\{o_2, o_3\})}{F_X(o_2)}\frac{F_X(\min\{o_1, o_2, o_3\})}{F_X(\min\{o_1, o_2\})}

The expansion of the n th term of the product is then

P(Y_n\le o_n | Y_1\le o_1, \dots, Y_{n-1}\le o_{n-1}) = \prod_{1 \le i\le n} \frac{F_X(\min\{o_n, o_{n-1}, o_{n - i + 1}\})}{F_X(\min\{o_{n-1}, o_{n - i + 1}\})}

so the joint cdf of Y_1, \dots, Y_n is then:

P(Y_1\le o_1, Y_2\le o_2, ..., Y_n \le o_n) = \\ P(Y_1\le o_1) P(Y_2\le o_2 | Y_1\le o_1) \cdots P(Y_n\le o_n | Y_1\le o_1, \dots, Y_{n-1}\le o_{n-1}) = \\ [F_X(o_1)] [F_X(o_2) \frac{F_X(\min\{o_1, o_2\})}{F_X(o_1)}] \cdots [\prod_{1 \le i\le n} \frac{F_X(\min\{o_n, o_{n-1}, o_{n - i + 1}\})}{F_X(\min\{o_{n-1}, o_{n - i + 1}\})}] = \\

denominators cancel out previous terms in the product, and we end with:

= \prod_{1 \le i\le n} F_X(\min\{o_n, o_{n-1}, o_{n - i + 1}\})) = \\ F_X(o_n) F_X(\min\{o_n, o_{n-1}\})\cdots F_X(\min\{o_n, o_{n-1}, \dots, o_1\})

My questions now are:

Is my math correct?
How can I implement a distribution with the cdf in the last equation in pymc3?

OriolAbril · August 2, 2021, 5:38am

There seem to be some issues with the math related to latex syntax or definition of latex blocks

Jaime_Sevilla · August 2, 2021, 9:14am

Interesting! It renders correctly for me

FWIW the first time I loaded the page I got a similar error, but after I reloaded it once it never happened again. Very mysterious!

jlindbloom · August 2, 2021, 7:33pm

I also noticed the math processing error went away when I refreshed.

I’ve spent some time thinking about this and I think I’ve settled on an answer I’m happy with.

Is my math correct?

No, I don’t think it is. So the expression of the density of Y_n = \max \{ X_1, \ldots, X_n \} where the X_i are i.i.d. and X_1 has a pdf f_X and a cdf F_X is:

f_Y(y) = n[F_X(y)]^{n-1} f_X(y).

In order to derive this expression, you begin as you did where you write things out in terms of the CDF and do a bunch of manipulations. However, this only works in the 1D case from what I can tell. When you move to the joint CDF, things get complicated and it’s unclear to me how to proceed. Particularly, the hard part seems to be conditioning Y_n on Y_{n-1}. I eventually tried taking a different approach which I’ll detail below.

Disclosure: I’m no expert on order statistics so I may be entirely wrong

I think this problem is much mathematically harder than you may have anticipated. From properties of maximums we know that Y_{n+1} = \max \{ X_1, \ldots, X_n, X_{n+1} \} = \max \{Y_n, X_{n+1} \}. So, this would mean that P(Y_{n+1} \leq y | Y_1, \ldots, Y_n) = P(Y_{n+1} \leq y | Y_n), i.e. that this sequence Y_i is Markovian in some sense. The distribution of Y_{n+1} only depends on the previous Y_{n} and the value of X_{n+1}. Let Y_{1:i} denote the collection \{ Y_1, \ldots, Y_n \}. Then the likelihood for your time series of n observations becomes

p_{Y_{1:n}}(y_1, \ldots, y_n | \mu, \sigma) = \prod_{i=1}^{n-1} p_{ Y_{i+1}| Y_{1:i}}(y_{i+1} | y_1, \ldots, y_{i}, \mu, \sigma) \\ = \prod_{i=1}^{n-1} p_{ Y_{i+1}| Y_{i}}(y_{i+1} | y_{i}, \mu, \sigma).

Ok, so it looks like all we need is an expression for the density f_{Y_{n+1}| Y_{n}} of Y_{n+1} conditioned on an observation of Y_{n}.

Now we’re going to try a different way to figure out the cdf by reasoning about how could draw samples of the random variable Y_{n+1}|Y_{n} = y_n. When we go to simulate Y_{n+1}, we first draw a sample for X_{n+1}. Let’s call this sample x_{n+1}. Then, we compare whether the sample is bigger or less than our current maximum y_n. If it’s smaller, we then set y_{n+1} = y_{n}; otherwise we set y_{n+1} = x_{n+1}. In math notation this process can be written as:

Flip a biased coin that yields heads with probability F_X (y_n) and tails with probability 1 - F_X(y_n).
If the coin is heads, then set y_{n+1} = y_n. If the coin is tails, draw y_n \sim Z where Z is a random variable that admits a density

f_Z(z) = \frac{f_X(z)}{1 - F_X(y_n)} \mathbb{\chi}_{z > y_n}.

Return the value for y_{n+1} as your sample of Y_{n+1}.

Intuitively, what we’ve done is make a new CDF F_{Y_{n+1}} where we take all of the probability mass (under F_{X}) for values less than y_n and concentrate it at the single point y_n. Then, the CDF continues as it would for F_X except scaled so that F_{Y_{n+1}}(\infty) = 1.

Here’s the cdf of your true RV X (normal with \mu = 10, \sigma = 1):

Here’s the cdf for Y_{n+1}, conditioned on an observation Y_{n} = 10:

So the random variable Y_{n+1} is a mixed random variable; it has both a continuous part and a discrete part. Unfortunately, this is where your approach breaks down: I think this breaks the statistical theory when trying to use the MCMC methods implemented in PyMC3 (Metropolis, NUTS, etc.). You could write out a density for the conditioned Y_{n+1} but this would involve Dirac delta functions.

IMO, if my reasoning above is correct then I think the best you could do using PyMC3 is to ignore the sequence of maximums and consider only the last one, Y_{N}. This throws out information but then all you need is a density for Y_{N} which you already know (see the formula above). You could then take the logarithm of this density and write a custom PyMC3 distribution that has this log density under its logpdf method. As long as you code this log density using functions in pm.math or other theano code then you get the gradient of the log density for free, then you’re all good to go to use NUTS.

Your final code would look like:

with pm.Model() as model:
  mu = pm.Normal('mu', mu=10, sigma=5)
  sigma = pm.Exponential('sigma', lam=1)

  # This shows the model only the last observation. N is the number of X_i that 
  # you've drawn leading up to Y_n, which will show up in your custom distribution definition
  likelihood = YourCustomDistribution("y_N", mu=mu, sigma=sigma, N=N, observed=obss[-1])

This is unsatisfying because an entire time series then only gives you a single observation. You could perhaps feed in the final values Y_N of an ensemble of time series generated the same way, but depending on your application you may not have access to this.

TL;DR I don’t think it is possible to approach this problem (in a satisfactory way) in PyMC3 since you end up doing MCMC on mixed continuous/discrete distributions.

Since I’m curious, where is the data you’re trying to model coming from?

Jaime_Sevilla · August 2, 2021, 9:36pm

Thank you again so much!

Let me see if I understand everything correctly, please bear with all my questions!

The marginal pdf

The pdf you provided for Y_n makes sense to me.

One easy way to see it is that

F_{Y_n}(o) = P(\max X_{1:n} \le o) = \prod_{1\le i \le n} P(X_i = n) = [F_X(o)]^{n}

hence the cdf is f_ {Y_n}(o) = F'_{Y_n}(o) = n [F_X(o)]^{n-1} f_X(o)

The joint pdf

I am a bit confused because I am not sure whether you meant if my math is wrong because there is a specific mistake or because what we need is the pdf instead of the cdf.

If the latter, can’t we derive the joint pdf by taking the partial derivatives of the cdf as usual?

Mixing random and continuous variables

According to this NIPS paper NUTS cannot be applied to mixed random variables.
About Metropolis I am confused, though this post here says it should work.

I am actually very confused on what different methods are implemented in PyMC3 and what am I doing when calling eg pm.sample(1000, step = [pm.Metropolis(), pm.NUTS()]). Is that sampling first with Metropolis and then with NUTS or doing a combination of both?

I am also confused about whether the distribution we want to model is mixed. Clearly the conditional distirbution Y_n | Y_{n-1} is, as you show in your answer. But what about the joint distribution of Y_{1:n}? And if it isn’t and we can find the pdf or approximate it numerically, could we get it working in PyMC3?

The code

I tried implementing your suggestion as follows:

import pymc3 as pm

dist =  pm.Normal.dist
N = len(obss)
logp = lambda mu, sigma, value : N + (N-1) * dist(mu, sigma).logcdf(value).sum() + dist(mu, sigma).logp(value)

with pm.Model() as model:
  mu = pm.Normal('mu', mu=10, sigma=5)
  sigma = pm.Exponential('sigma', lam=1)

  # This shows the model only the last observation. N is the number of X_i that 
  # you've drawn leading up to Y_n, which will show up in your custom distribution definition
  Yn = pm.DensityDist("Yn", logp, observed= {'mu':mu, 'sigma': sigma, 'value':obss[-1]})

  trace = pm.sample(1000, step = [pm.Metropolis(), pm.NUTS()])
  pm.traceplot(trace, ['mu', 'sigma', 'Yn'])

However I am getting an error MissingInputError: Undeclared input. What am I doing wrong?

Assuming fake independence
I wonder whether we could get sensible-enough results by pretending that each Y_i is independent and feeding that into the model. This would introduce a bias upwards in the parameters, but maybe it is maneageble. What are your thoughts?

The data
My eventual goal is to model the World Records on speedrunning competitions (where people attempt to beat a videogame as fast as possible) The code to scrape the data is here, in case you are curious.

The exercise in the OP was a warmup before I think about what distribution to use for each attempt, what priors to use and how to model a minimum instead of a maximum.

jlindbloom · August 5, 2021, 4:28am

Apologies in advance for this wall of text but I think it’s a pretty nice solution to an awesome problem . The first section responds to your questions, the second section shows my proposed solution.

Q&A

I am a bit confused because I am not sure whether you meant if my math is wrong because there is a specific mistake or because what we need is the pdf instead of the cdf.

I took a second look and I stand corrected, I actually think your math is correct (although I only checked the expression for P(Y_2 \leq o_2 | Y_1 \leq o_1) ).

If the latter, can’t we derive the joint pdf by taking the partial derivatives of the cdf as usual?

I think this is where the problem comes up. You end up taking derivatives of min’s/max’s and the result gets very complicated. For a custom distribution, you ultimately need the log pdf and not the CDF, so given your formula is correct I’m not sure how you would proceed.

An additional comment is that while it might be possible that you could somehow figure out a formula for the joint CDF then take partial derivatives, you don’t actually need the full joint CDF. But one of the observations I made was that you don’t need the full joint PDF/CDF, since

Y_{n+1} | \big( Y_{n} = y_n, Y_{n-1} = y_{n_1}, \ldots, Y_1 = y_1 \big) = Y_{n+1} | Y_n = y_n.

Because of conditional independences amongst the sequence you only need to know the PDF’s/CDF’s for the n conditional distributions f_{Y_{n+1}|Y_n} corresponding to Y_{n+1}| Y_n = y_n instead of the full pdf f_{Y_{1:n+1}}. This greatly simplifies things as opposed to trying to figure out the full joint pdf/cdf.

According to this NIPS paper NUTS cannot be applied to mixed random variables.

I actually found this same paper when I was looking into this. I think there’s a bit of terminology overloading going on, however. In this paper by “mixed” they are referring to random vectors where some components are discrete and some components are continuous, and not “mixed” as in a single component is both discrete & continuous. You could potentially have a vector of mixed random variables, where each component was both continuous and discrete.

I am actually very confused on what different methods are implemented in PyMC3 and what am I doing when calling eg pm.sample(1000, step = [pm.Metropolis(), pm.NUTS()]) . Is that sampling first with Metropolis and then with NUTS or doing a combination of both?

From briefly looking at the source, I think this is trying to assign a different step method for each variable of the posterior, in the order that you provide in the list. If you don’t provide enough step methods as there are random variables, then it picks the best one according to whether it’s discrete or continuous. I might be wrong about what precisely is going on here but my advice is to not mess with this.

Assuming fake independence

I don’t think this is a good idea but you could try it.

My Solution

I’m not entirely convinced that the approach I’m about to detail is correct, but I think it’s within an epsilon-ball of being correct. So I’ll go ahead and explain.

This post on stats StackExchange gives an excellent response to someone asking about the likelihood function of a mixed random variable. Translating this into our problem, we can express the CDF of the conditioned random variable Y_{n+1} | (Y_n = y_n, \theta) as

F_{Y_{n+1}|Y_n = y_n, \theta} = F_{a, \theta}(x) + F_{d, \theta}(x) = \lambda(\theta) \int_{\infty}^x f_a(t; \theta) \, dt + (1 - \lambda(\theta)) \sum_{t \leq x} f_d(t; \theta).

Here \theta is referring to the collection [\mu, \sigma] of the parameters of the underlying Normal distribution for each X_n. The first term in the sum above contains the contribution to the CDF from the continuous portion of the data, while the second term contains the contribution from the discrete part. The term \lambda(\theta) is the mixing probability, which you interpret as the probability that an observation came from the continuous part as opposed to the discrete part.

In the above, the density f_a(t;\theta) corresponds to just the density I mentioned in my initial initial answer:

f_a(t; \theta) = f_Z(z; \theta) = \frac{f_X(z; \theta)}{1 - F_X(y_n; \theta)}\chi_{z > y_n}

where \chi is just the indicator function. The discrete probability mass function f_d(t; \theta), is just the “dumb” random variable that puts all of its mass on y_n. The \lambda(\theta) parameter is actually just

\lambda(y_{n-1}, \theta) = 1 - F_X(y_{n-1}; \theta)

which was the probability of the biased coin landing on tails from my response.

Now, that StackExchange answer proposed that the likelihood function could be computed as

\mathcal{L}(\mathcal{Y}; \theta) = \prod_{i \in C} \lambda(\theta) f_a(y_i; \theta) \,\, \prod_{i \in D} (1 - \lambda(\theta)) f_d(y_i; \theta)

where we split the data \mathcal{Y} = {y_1, \ldots, y_n} into two sets, where C is the set of indices for which the observation y_i, i \in C does not equal y_{i-1} (we flipped a tails and drew from the continuous distribution) and D is the set of indices for which the observation y_i, i \in D where y_i = y_{i-1} (we flipped a heads and stuck with y_n). Note that you always know whether y_i came from the discrete/continuous draw just by looking if the running maximum changed/didn’t change. The explicit steps to do this are:

Write a function for the logarithm of the density f_Z(z; \theta) I gave above.
Assemble the list of \lambda(y_{n-1}, \theta)'s and (1 -\lambda(y_{n-1}, \theta))'s. These will be lists of two different lengths whose total length sums to the number of observations. Finally, take the logarithms of both lists, then sum them together and call it log_lambdas_sum.
Write a master logp function that computes the logarithm of the likelihood I gave above. The log probability should end up being log_lambdas_sum + the sum of the \log f_a(y_i; \theta) for each i \in C. We end up ignoring the parts for the discrete pmf since at all of the i \in D the pmf f_d(y_i; \theta) is just 1 so they drop out.

Summary

I was probably too lazy and garbled some of the notation above, but hopefully you can get a sense of how you can calculate the likelihood. It turns out, we can decompose the likelihood function for the mixed random variable into its discrete and continuous parts, and then evaluate to total likelihood by evaluating the individual likelihoods on two partitioned sets of the data and weighting by the \lambda factors. I am working on a notebook that implements this but can’t promise when I’ll get around to finishing it, so if you can see the light at the end of the tunnel I encourage you to try this yourself!

Jaime_Sevilla · August 5, 2021, 4:16pm

The wall of text is very much appreciated!

Let me see if I understand the proposed solution. We have that we can decompose the conditional cdf as:

F_{Y_{n+1} | Y_n=o_n}(o_{n+1}) = P(Y_{n+1} \le o_{n+1} | Y_n=o_n) = \\ = P(X \le o_{n+1}) I(o_n \le o_{n+1}) = F_X(o_n) I(o_n \le o_{n+1}) + I(o_n \le o_{n+1}) \int_{o_n}^{o_{n+1}}f_X = \\ = F_X(o_n) I(o_n \le o_{n+1}) + \frac{(1-F_X(o_n))}{(1-F_X(o_n))}I(o_n \le o_{n+1}) \int_{o_n}^{o_{n+1}}f_X = \\ = F_X(o_n) I(o_n \le o_{n+1}) + (1-F_X(o_n))I(o_n \le o_{n+1}) \frac{\int_{o_n}^{o_{n+1}}f_X}{(1-F_X(o_n))} = \\ = F_X(o_n) I(o_n \le o_{n+1}) + (1-F_X(o_n))I(o_n \le o_{n+1}) \frac{\int_{o_n}^{o_{n+1}}f_X}{\int_{o_{n}}^{\infty} f_X} = \\ \lambda_{n+1} P(A_{n+1} \le o_{n+1}) + (1-\lambda_{n+1}) P(X_{n+1} \le o_{n+1} | X_{n+1} > o_n)

Where we have that \lambda_{n+1} = F_X(o_n) and A_{n+1} is a constant random variable with value o_{n}.

Hence we have expressed the conditional random variable Y_{n+1} | Y_n=y_n as a mixture of A_{n+1} and B_{n+1} = X_{n+1} | X_{n+1} > o_n, with mixing probability \lambda_{n+1} = F_X(o_n).

Using the trick in the link you sent, we can then write the conditional likelihood for o_{n+1} given o_n as

\mathcal{L}_{n+1} (o_{n+1} | o_n) = \lim_{\epsilon\to 0} \frac{F_{Y_{n+1} | Y_n=o_n}(o_{n+1} + \epsilon) - F_{Y_{n+1} | Y_n=o_n}(o_{n+1} - \epsilon)}{2\epsilon} \approx \\ = \left.\begin{cases} \lambda_{n+1} F_{A_{n+1}}(o_{n+1}), & \text{if } o_{n} = o_{n+1}\\ (1 - \lambda_{n+1}) f_{B_{n+1}}(o_{n+1}), & \text{if } o_{n} < o_{n+1}\\ \end{cases}\right\} = \\ = \left.\begin{cases} F_X(o_n) , & \text{if } o_{n} = o_{n+1}\\ (1 - F_X(o_n)) \frac{f_X(o_{n+1})}{\int_{o_n}^\infty f_X}, & \text{if } o_{n} < o_{n+1}\\ \end{cases}\right\} = \\ = \left.\begin{cases} F_X(o_n) , & \text{if } o_{n} = o_{n+1}\\ f_X(o_{n+1}), & \text{if } o_{n} < o_{n+1}\\ \end{cases}\right\}

denoting as D the set of indices i>1 where o_i = o_{i-1} and C its complement, we can then denote the total likelihood as:

\mathcal{L}(\mathbf o) = f_X(o_1)\prod_{i\in D} F_X(o_i) \prod_{i\in C} f_X(o_{i})

To make this work in pymc3, we have to code the logarithm of this expression as a python function, and implement it as a joint distribution object (which I am still confused about how to do)

So far what I could implement is the code for the log likelihood part.
If we assume that the underlying distribution of X_i is normal then it would be:

import numpy as np
import pymc3 as pm

def logp(os, mu, sigma, epsilon = 0.00001):
  x_dist = pm.Normal.dist(mu=mu, sigma=sigma)
  log_likelihood = 0
  o_n_minus_1 = -np.math.inf
  for o_n in os:
    if np.abs(o_n - o_n_minus_1) < epsilon: # if the value is equal to the last one
      log_likelihood += x_dist.logp(o_n).eval()
    elif o_n_minus_1 - o_n: #if the value is superior than last one
      log_likelihood += x_dist.logcdf(o_n).eval()
    else: # if the value is inferior than last one
      log_likelihood += -np.math.inf
    o_n_minus_1 = o_n
  return log_likelihood

We can test this log likelihood function on some sampled data:

# TEST THE LOGP FUNCTION

# Model parameters
N = 20
mu = 10.
sigma = 0.1

# Sample generation
ts = np.arange(N)
obss = []
obs = -math.inf
for t in ts:
  aux = mu + sigma*np.random.randn()
  obs = np.max([obs, aux])
  obss.append(obs)
obss = np.array(obss)

mus = np.linspace(0., 20.)
log_likelihoods = np.vectorize(lambda mu : logp(obss, mu=mu, sigma=1.))(mus)

import matplotlib.pyplot as plt
plt.plot(mus, log_likelihoods);

The peak of the loglikelihood corresponds to the value of \mu we used to generate the data, so it seems to be working!

I tried implementing this as a DensityDistribution in PyMC3 as this:

with pm.Model() as model:
  mu = pm.Uniform('mu', 0., 20.)
  sigma = pm.Uniform('sigma', 0.5, 1.5)
  Ys = pm.DensityDist('Ys', logp, observed = {'os':obss, 'mu': mu, 'sigma': sigma})
  trace = pm.sample(1000)
  pm.traceplot(trace, ['mu', 'sigma'])

But PyMC3 didn’t like my logp; I get TypeError: Variables do not support boolean operations.
I guess I needed to implement the function as you said, with different distirbution depending on whether the observations are the same or different.

jlindbloom · August 5, 2021, 4:19pm

I will take a closer look later but the issue is that you need to use Theano’s functions for Boolean comparison instead of just vanilla python if/else statements. Check out theano.tensor.switch. You will also want to use theano.scan for your for loop, but maybe you can get by without it.

Jaime_Sevilla · August 5, 2021, 7:23pm

I see!

I followed your advice and used the backend operations.

# Sample parameters
N = 20
mu = 10.
sigma = 1.0

# Sample generation
ts = np.arange(N)
obss = []
obs = -math.inf
for t in ts:
  aux = mu + sigma*np.random.randn()
  obs = np.max([obs, aux])
  obss.append(obs)
obss = np.array(obss)

# PyMC3 model
def logp(os, mu, sigma, epsilon = 0.00001):
  x_dist = pm.Normal.dist(mu=mu, sigma=sigma)

  log_likelihood= pm.math.constant(0.)
  o_n_minus_1 = pm.math.constant(0.)
  eps = pm.math.constant(epsilon)
  for o_n in os:
    log_likelihood += \
      pm.math.switch(pm.math.lt(pm.math.abs_(o_n - o_n_minus_1), eps), 
                    x_dist.logp(o_n), 
                    x_dist.logcdf(o_n)
      )
    o_n_minus_1 = o_n
  
  return log_likelihood

with pm.Model() as model:
  mu = pm.Uniform('mu', 0., 20.)
  sigma = pm.Uniform('sigma', 0.5, 1.5)
  Ys = pm.DensityDist('Ys', logp, observed = {'os':obss, 'mu': mu, 'sigma': sigma})
  trace = pm.sample(5000, return_inferencedata=True, idata_kwargs={"density_dist_obs": False})
  pm.plot_trace(trace, ['mu', 'sigma'])

I think it works!

Jaime_Sevilla · August 5, 2021, 8:30pm

Ok, maybe I spoke too fast.

Executing the previous script with \mu = 10, \sigma=1 and n=100 seems to result in a biased posterior (?)

jlindbloom · August 5, 2021, 8:45pm

Yes, I think we are close to celebrating but not quite. I kicked the number of observations N up to 200 and drew 300K samples x 4 chains. This was the result:

There is clearly some bias. I will take a look at this but I think you don’t have the likelihood correct (not the implementation, I don’t think the formula in your last post you gave is correct).

Jaime_Sevilla · August 5, 2021, 9:11pm

Wait! I mixed up the clauses in the switch.

Here is the fixed code (plus some other small bug fixes)

# Sample parameters
N = 100
mu = 10.
sigma = 1.0

# Sample generation
ts = np.arange(N)
obss = []
obs = -math.inf
for t in ts:
  aux = mu + sigma*np.random.randn()
  obs = np.max([obs, aux])
  obss.append(obs)
obss = np.array(obss)

# PyMC3 model
def logp(os, mu, sigma, epsilon = 0.00001):
  x_dist = pm.Normal.dist(mu=mu, sigma=sigma)

  log_likelihood= pm.math.constant(0.)
  o_n_minus_1 = pm.math.constant(-np.math.inf)
  eps = pm.math.constant(epsilon)
  for o_n in os:
    log_likelihood += \
      pm.math.switch(pm.math.lt(pm.math.abs_(o_n - o_n_minus_1), eps), 
                    x_dist.logcdf(o_n),
                    x_dist.logp(o_n),
      )
    o_n_minus_1 = o_n
  
  return log_likelihood

with pm.Model() as model:
  mu = pm.Uniform('mu', 0., 20.)
  sigma = pm.Uniform('sigma', 0.5, 1.5)
  Ys = pm.DensityDist('Ys', logp, observed = {'os':obss, 'mu': mu, 'sigma': sigma})
  trace = pm.sample(5000, return_inferencedata=True, idata_kwargs={"density_dist_obs": False})
  pm.plot_trace(trace, ['mu', 'sigma'])

Much better!

What do you think?

jlindbloom · August 6, 2021, 7:02pm

I think we can celebrate now.

So I initially thought that your likelihood you wrote down in an earlier post was wrong and that your code looked too simple, but I did the math and some terms actually cancel out. That being said, you had one small error. When you calculated the contribution from the likelihood for the data where the maximum didn’t change, you need to use x_dist.logcdf(o_n_minus_1) rather than x_dist.logp(o_n). Can you see why? Here’s what I get when I make this change.

Ok, so a little less biased perhaps. But it still seems like something weird is going on.

One thing you may have noticed is that if you increase the number of observations in your model, it takes a very long time to compile the model (yet sampling is very fast). This is because you are using a for loop in your code. I mentioned earlier that you should use theano.scan, but actually we can just vectorized the whole likelihood calculation like this:

# Sample parameters
N = 1000
mu = 10.
sigma = 1.0

# Sample generation
ts = np.arange(N)
obss = []
obs = -math.inf
for t in ts:
  aux = mu + sigma*np.random.randn()
  obs = np.max([obs, aux])
  obss.append(obs)
obss = np.array(obss)

### First, filter for the data where there were jumps in the running max

jump_idxs = np.where(np.diff(obss) > 0)[0] + 1 # This gives you the indices of the array where the max changed (except for the first element)
jump_idxs = np.insert(jump_idxs, 0, 0) # Include the first element as a jump
jump_idxs_right_before = jump_idxs[1:] - 1 # This gives you the indices right before the change

jump_data = obss[jump_idxs] # Slice out the data for the actual jumps. Serves as our o_n for the jumps
jump_data_right_before = obss[jump_idxs_right_before] # Same but for the right before the jumps (we'll use these as o_n_minus_1)
jump_data_right_before = np.insert(jump_data_right_before, 0, -np.inf) # Correct for first element

### Second, filter for the data where the running max didn't change

flat_idxs = np.setdiff1d(np.arange(len(obss)), jump_idxs)
flat_idxs_right_before = flat_idxs - 1

flat_data = obss[flat_idxs] # Serves as our o_n for the data with no_jumps
flat_data_right_before = obss[flat_idxs_right_before]


# PyMC3 model
def logp(jump_data, flat_data_right_before, mu, sigma):
  x_dist = pm.Normal.dist(mu=mu, sigma=sigma)

  log_likelihood = pm.math.constant(0.)
  
  # Add likelihood contribution from the jump data
  log_likelihood += pm.math.sum(x_dist.logp(jump_data))

  # Add likelihood contribution from the flat data
  log_likelihood += pm.math.sum(x_dist.logcdf(flat_data_right_before))
  
  return log_likelihood


with pm.Model() as model:
  mu = pm.Uniform('mu', 0., 20.)
  sigma = pm.Uniform('sigma', 0.1, 10.0)
  Ys = pm.DensityDist('Ys', logp, observed = {'jump_data':jump_data, 'flat_data_right_before':flat_data_right_before, 'mu': mu, 'sigma': sigma})
  trace = pm.sample(20000, chains=4, return_inferencedata=True, idata_kwargs={"density_dist_obs": False})

Basically, I just used the factorization of the likelihood to vectorized the contributions from both the data where max changed/didn’t change. You’ll find that this code runs significantly faster, especially when you increase your number of observations N. I’m not sure if there was something wrong with the code from earlier, but in this new code, the posterior on sigma actually looks a lot different/better. This is with N = 100:

I’ve found that sometimes you will get chains with a huge number of divergences, whereas sometimes you will get chains with none at all. I’m not sure what might be causing this, but I bet it has something to do with the discrete/continuous nature of this problem. Perhaps we violate an assumption for using NUTS and something like Metropolis is better to use in this case. I’ll have to think more about this.

Jaime_Sevilla · August 6, 2021, 8:58pm

Neat implementation!

I am confused.

First, I think there must be a typo, since we want to use the CDF when dealing with flat data, right?
Given that and your code, I am reading that we should use \mathcal{L}(o_{n+1} | o_n) = F_A(o_{n}) instead of \mathcal{L}(o_{n+1} | o_n) = F_A(o_{n+1}) when o_n = o_{n+1}.
But if o_n = o_{n+1} then F_A(o_{n+1}) = F_A(o_{n}), so I must be misunderstanding something here

jlindbloom · August 6, 2021, 9:05pm

When you calculated the contribution from the likelihood for the data where the maximum didn’t change, you need to use x_dist.logcdf(o_n_minus_1) rather than x_dist.logp(o_n) .

Sorry, I meant to say:

When you calculated the contribution from the likelihood for the data where the maximum didn’t change, you need to use x_dist.logcdf(o_n_minus_1) rather than x_dist.logcdf(o_n).

And thanks for pointing that out, yes you are correct. In my head I was thinking that it doesn’t make sense to evaluate the likelihood on the same value that is already being plugged into the definition of the likelihood, but in the case o_n = o_{n+1} this doesn’t matter. It shouldn’t matter which you use, and I guess you don’t even need to keep track of the _right_before arrays at all. You can just give the DensityDist flat_data itself rather than flat_data_right_before.

Jaime_Sevilla · August 6, 2021, 9:25pm

I see!

So to synthesize, the code would look like:

# Sample parameters
N = 10
mu = 10.
sigma = 1.0

import numpy as np
import pymc3 as pm

# Sample generation
ts = np.arange(N)
obss = []
obs = -np.math.inf
for t in ts:
  aux = mu + sigma*np.random.randn()
  obs = np.max([obs, aux])
  obss.append(obs)
obss = np.array(obss)

### Separate the data where there were jumps in the running max
### and the data where the running max stayed constant
jump_mask = np.insert(np.diff(obss) > 0,0,True) 
jump_data = obss[jump_mask] # Slice out the data for the actual jumps. Serves as our o_n for the jumps
flat_data = obss[~jump_mask] # Serves as our o_n for the data with no_jumps

# PyMC3 model
def logp(jump_data, flat_data, mu, sigma):
  x_dist = pm.Normal.dist(mu=mu, sigma=sigma)
  
  # Add likelihood contribution from the jump data
  log_likelihood = pm.math.sum(x_dist.logp(jump_data))

  # Add likelihood contribution from the flat data
  log_likelihood += pm.math.sum(x_dist.logcdf(flat_data))
  
  return log_likelihood


with pm.Model() as model:
  mu = pm.Uniform('mu', 0., 20.)
  sigma = pm.Uniform('sigma', 0.5, 1.5)
  Ys = pm.DensityDist('Ys', logp, observed = {'jump_data':jump_data, 'flat_data':flat_data, 'mu': mu, 'sigma': sigma})
  trace = pm.sample(2000, chains=4, return_inferencedata=True, idata_kwargs={"density_dist_obs": False})
  pm.plot_trace(trace, ['mu', 'sigma'])

Jaime_Sevilla · August 7, 2021, 9:55pm

I’m back with more questions!

I want to make two plots for the model.

A posterior predictive plot, showing a high density interval of the model conditional on the posterior for \mu, \sigma on top of the plot of actual observations to see whether the model is doing a good job at predicting those
An extrapolation plot,

For 1) I tried following the example here.

My code is:

# Sample parameters
N = 100
mu = 10.0
sigma = 1.0

import numpy as np
import pymc3 as pm
import arviz as az

# Random sampling
def random(point=None, size=None):
  # Retrive parameters
  mu = point['mu']
  sigma = point['sigma']

  # Generate sample
  ts = np.arange(N)
  obss = []
  obs = -np.math.inf
  for t in ts:
    aux = mu + sigma*np.random.randn()
    obs = np.max([obs, aux])
    obss.append(obs)
  obss = np.array(obss)

  return obss

obss = random({'mu': mu, 'sigma':sigma})

### Separate the data where there were jumps in the running max
### and the data where the running max stayed constant
jump_mask = np.insert(np.diff(obss) > 0,0,True) 
jump_data = obss[jump_mask] # Slice out the data where a new record is set
flat_data = obss[~jump_mask] # Slice the data where the record is maintained

# Logarithmic likelihood of the joint distribution of data
def logp(jump_data, flat_data, mu, sigma):
  x_dist = pm.Normal.dist(mu=mu, sigma=sigma)
  
  # Add likelihood contribution from the jump data
  log_likelihood = pm.math.sum(x_dist.logp(jump_data))

  # Add likelihood contribution from the flat data
  log_likelihood += pm.math.sum(x_dist.logcdf(flat_data))
  
  return log_likelihood

# PyMC3 model definition
with pm.Model() as model:
  mu = pm.Uniform('mu', 0., 20.)
  sigma = pm.Uniform('sigma', 0.5, 1.5)
  Ys = pm.DensityDist('Ys', logp, random=random,
                      observed = {'jump_data':jump_data, 
                                  'flat_data':flat_data, 
                                  'mu': mu, 
                                  'sigma': sigma}
                      )
  trace = pm.sample(1000, chains=2, tune=1000,
                    return_inferencedata=True, 
                    idata_kwargs={"density_dist_obs": False})

  post_pred = pm.sample_posterior_predictive(trace, var_names=["Ys"])
  idata = az.concat(trace, az.from_pymc3(posterior_predictive=post_pred))
  az.plot_ppc(idata)

When I execute the above snippet, I am getting an internal Theano error MissingInputError: Input 0 of the graph (indices start from 0), used to compute sigmoid(mu_interval__), was not provided and not given a value..

For 2… I don’t even know how to start.

PS: Still very interested in your thoughts about the divergence and what parameters to use for the NUTS sampling if you think more about that!

Jaime_Sevilla · August 9, 2021, 2:04pm

Update: I kind of gave up on Arviz and used Altair to plot the posterior predictive distribution:

# Generate predictive posterior
with model:
  post_pred = pm.sample_posterior_predictive(trace, var_names=["Ys"])

# Preprocess predictive posterior
post_pred_obss = post_pred['Ys']
lower_bound = np.quantile(post_pred_obss, q=0.05, axis=0)
median = np.quantile(post_pred_obss, q=0.5, axis=0)
upper_bound = np.quantile(post_pred_obss, q=0.95, axis=0)

# Encode as pandas DataFrame
import pandas as pd
data_df = pd.DataFrame({'t': ts, 'obs' : obss})
hdi_df = pd.DataFrame({
    't': ts,
    'lower': lower_bound,
    'median' : median,
    'upper': upper_bound,
    })

# Plot observations and predictive posterior with Altair
import altair as alt
data_chart = alt.Chart(data_df)\
.mark_line(
    # size=10, 
    color='purple',
    )\
.encode(
    x=alt.X('t:Q', title='Timestep'),
    y=alt.Y('obs:Q', title='Cumulative maximum'),
)

# https://stackoverflow.com/questions/60649486/line-chart-with-custom-confidence-interval-in-altair
line_chart = alt.Chart(hdi_df).mark_line().encode(
    x='t:Q',
    y='median:Q'
)

band_chart = alt.Chart(hdi_df).mark_area(
    opacity=0.5
).encode(
    x='t:Q',
    y='lower:Q',
    y2='upper:Q'
)

(data_chart + line_chart + band_chart)\
.properties(
  width=800,
  height=400
)\
.configure_axis(
  labelFontSize=20,
  titleFontSize=30
  )

I am a bit nervous because I expected the uncertainty to grow wider with time, but we see it is a band of constant width. Is my intuition off or is this symptom of a problem?

For 2 I am still stuck. I tried cheating my way out of the problem by defining a chain of variables with the conditional marginal likelihood we previously derived. But I don’t seem to be able to instruct PyMC to feed to the variable the last variable of the joint distribution we defined

#@title PyMC model

# Sample parameters
N = 100
M = 5
mu = 10.0
sigma = 1.0

import numpy as np
import pymc3 as pm

# Random sampling function
def random(point=None, size=None, N=N):

  # Retrive parameters
  mu = point['mu']
  sigma = point['sigma']

  # Generate sample
  obss = []
  obs = -np.math.inf
  for t in range(N):
    aux = mu + sigma*np.random.randn()
    obs = np.max([obs, aux])
    obss.append(obs)
  obss = np.array(obss)

  return obss

# Generate fake observations
ts = np.arange(N) 
obss = random({'mu': mu, 'sigma':sigma})

### Separate the data where there were jumps in the running max
### and the data where the running max stayed constant
jump_mask = np.insert(np.diff(obss) > 0,0,True) 
jump_data = obss[jump_mask] # Slice out the data where a new record is set
flat_data = obss[~jump_mask] # Slice the data where the record is maintained

# Logarithmic likelihood of the joint distribution of data
def logp(jump_data, flat_data, mu, sigma):
  x_dist = pm.Normal.dist(mu=mu, sigma=sigma)
  
  # Add likelihood contribution from the jump data
  log_likelihood = pm.math.sum(x_dist.logp(jump_data))

  # Add likelihood contribution from the flat data
  log_likelihood += pm.math.sum(x_dist.logcdf(flat_data))
  
  return log_likelihood

def next_logp(o_n, o_n_minus_1, mu, sigma):
  x_dist = pm.Normal.dist(mu=mu, sigma=sigma)
  log_likelihood = \
    pm.math.switch(
        pm.math.eq(o_n, o_n_minus_1), 
        x_dist.logp(o_n), 
        x_dist.logcdf(o_n)
        )
  return log_likelihood

# PyMC3 model definition
with pm.Model() as model:
  mu = pm.Uniform('mu', 0., 20.)
  sigma = pm.Uniform('sigma', 0.5, 1.5)
  Ys = pm.DensityDist('Ys', logp, 
                      shape = (N,),
                      random=random,
                      observed = {'jump_data':jump_data, 
                                  'flat_data':flat_data, 
                                  'mu': mu, 
                                  'sigma': sigma}
                      )
  
  previous = Ys[-1]  # This does not work
  for i in range(M):
    Zi = pm.DensityDist(f'Z{i}', next_logp,
                        observed = {
                            'previous' : previous,
                            'mu': mu,
                            'sigma': sigma,
                        })
    previous = Zi
  
  # Sample 
  trace = pm.sample(1000, chains=2, tune=1000,
                    return_inferencedata=True, 
                    idata_kwargs={"density_dist_obs": False})
  pm.plot_trace(trace, ['mu', 'sigma'])

Eg with the snippet above I get the error TypeError: 'MultiObservedRV' object is not subscriptable. How can I access one dimension of a joint distribution?

Jaime_Sevilla · August 10, 2021, 11:18am

Update: I could not figure out how to extend the model to include extra variables, so I just wrote my own sampler, that takes values from the trace and does the whole sampling procedure, the using Altair for plotting:

def random(point=None, size=None, N=N, initial_obs = -np.math.inf):

  # Retrive parameters
  mu = point['mu']
  sigma = point['sigma']

  # Generate sample
  obss = []
  obs = initial_obs
  for t in range(N):
    aux = mu + sigma*np.random.randn()
    obs = np.max([obs, aux])
    obss.append(obs)
  obss = np.array(obss)

  return obss

#@title Extrapolate to future timesteps

M = 100

import pandas as pd

# Generate samples from the posterior distribution of future data

mu_chains = trace['posterior']['mu'].data
sigma_chains = trace['posterior']['sigma'].data

extrapolation_data = []
for chain_number, (mu_chain, sigma_chain) in enumerate(zip(mu_chains, sigma_chains)):
  for mu, sigma in zip(mu_chain, sigma_chain):
    extrapolated_obss = random({'mu':mu, 'sigma':sigma}, N=M, initial_obs=obss[-1])
    extrapolation_data += [extrapolated_obss]
extrapolation_data = np.array(extrapolation_data)

assert extrapolation_data.shape[1] == M

# Preprocess extrapolated posterior
lower_bound = np.quantile(extrapolation_data, q=0.05, axis=0)
median = np.quantile(extrapolation_data, q=0.5, axis=0)
upper_bound = np.quantile(extrapolation_data, q=0.95, axis=0)

assert median.shape == (M,)

# Print guesses for last observation
print(f"The median guess for the record at time {N+M} is {median[-1]:.2f}")
print(f"The 90% confidence interval for the record at time {N+M} is {lower_bound[-1]:.2f} to {upper_bound[-1]:.2f}")

# Encode as pandas DataFrame
import pandas as pd
ex_ts = np.array(range(N, N+M))
ext_df = pd.DataFrame({
    't': ex_ts,
    'lower': lower_bound,
    'median' : median,
    'upper': upper_bound,
    })

# Plot the extrapolated data
# https://stackoverflow.com/questions/60649486/line-chart-with-custom-confidence-interval-in-altair
import altair as alt
line_chart = alt.Chart(ext_df).mark_line().encode(
    x=alt.X('t:Q', title='Timestep'),
    y=alt.Y('median:Q', title='Cumulative maximum'),
)

band_chart = alt.Chart(ext_df).mark_area(
    opacity=0.5
).encode(
    x='t:Q',
    y='lower:Q',
    y2='upper:Q'
)

(line_chart + band_chart)\
.properties(
  width=800,
  height=400
)\
.configure_axis(
  labelFontSize=20,
  titleFontSize=30
  )

There might be some bugs but I think the approach should work!

jlindbloom · August 10, 2021, 6:12pm

Excellent!

I am a bit nervous because I expected the uncertainty to grow wider with time, but we see it is a band of constant width. Is my intuition off or is this symptom of a problem?

Actually, your result is exactly what you should expect. As you march forward in time, there is an ever-smaller chance that an attempt will succeed the previous maximum. At some point, the maximum virtually never changes. The width of your posterior predictive on the maximum at this point is then entirely driven by the parametric uncertainty of the underlying normal random variable. The maximum, in the long run, settles down to about the 99.9% quantile of the underlying Gaussian if that makes sense. So the uncertainty band you get should have constant width, the lower bound corresponding to the posterior \mu and \sigma that yields the smallest 99.9% quantile compared to all other posterior samples, and similarly for the upper bound. Of course there is some adjustment to be made since you are really instead taking the lower 5% / upper 95% quantile across all of these quantiles.

Topic		Replies	Views
Adding observed to result of maximum Questions	2	916	August 1, 2021
Modelling a Time Series of Records with PyMC3 Sharing	1	1453	September 24, 2021
Seeds: Random effect logistic regression Questions	31	1274	March 29, 2022
Sampling Error: GEV PyMC3 Questions bug	22	2216	August 7, 2022
Regression model sampling solely one sample every 5th second version agnostic modeling	2	511	August 29, 2022

Modelling a timeseries of cumulative maximum

Q&A

My Solution

Summary

Related topics