Apologies in advance for this wall of text but I think it’s a pretty nice solution to an awesome problem . The first section responds to your questions, the second section shows my proposed solution.
Q&A
I am a bit confused because I am not sure whether you meant if my math is wrong because there is a specific mistake or because what we need is the pdf instead of the cdf.
I took a second look and I stand corrected, I actually think your math is correct (although I only checked the expression for P(Y_2 \leq o_2 | Y_1 \leq o_1) ).
If the latter, can’t we derive the joint pdf by taking the partial derivatives of the cdf as usual?
I think this is where the problem comes up. You end up taking derivatives of min’s/max’s and the result gets very complicated. For a custom distribution, you ultimately need the log pdf and not the CDF, so given your formula is correct I’m not sure how you would proceed.
An additional comment is that while it might be possible that you could somehow figure out a formula for the joint CDF then take partial derivatives, you don’t actually need the full joint CDF. But one of the observations I made was that you don’t need the full joint PDF/CDF, since
Y_{n+1} | \big( Y_{n} = y_n, Y_{n-1} = y_{n_1}, \ldots, Y_1 = y_1 \big) = Y_{n+1} | Y_n = y_n.
Because of conditional independences amongst the sequence you only need to know the PDF’s/CDF’s for the n conditional distributions f_{Y_{n+1}|Y_n} corresponding to Y_{n+1}| Y_n = y_n instead of the full pdf f_{Y_{1:n+1}}. This greatly simplifies things as opposed to trying to figure out the full joint pdf/cdf.
According to this NIPS paper NUTS cannot be applied to mixed random variables.
I actually found this same paper when I was looking into this. I think there’s a bit of terminology overloading going on, however. In this paper by “mixed” they are referring to random vectors where some components are discrete and some components are continuous, and not “mixed” as in a single component is both discrete & continuous. You could potentially have a vector of mixed random variables, where each component was both continuous and discrete.
I am actually very confused on what different methods are implemented in PyMC3 and what am I doing when calling eg pm.sample(1000, step = [pm.Metropolis(), pm.NUTS()])
. Is that sampling first with Metropolis and then with NUTS or doing a combination of both?
From briefly looking at the source, I think this is trying to assign a different step method for each variable of the posterior, in the order that you provide in the list. If you don’t provide enough step methods as there are random variables, then it picks the best one according to whether it’s discrete or continuous. I might be wrong about what precisely is going on here but my advice is to not mess with this.
Assuming fake independence
I don’t think this is a good idea but you could try it.
My Solution
I’m not entirely convinced that the approach I’m about to detail is correct, but I think it’s within an epsilon-ball of being correct. So I’ll go ahead and explain.
This post on stats StackExchange gives an excellent response to someone asking about the likelihood function of a mixed random variable. Translating this into our problem, we can express the CDF of the conditioned random variable Y_{n+1} | (Y_n = y_n, \theta) as
F_{Y_{n+1}|Y_n = y_n, \theta} = F_{a, \theta}(x) + F_{d, \theta}(x) = \lambda(\theta) \int_{\infty}^x f_a(t; \theta) \, dt + (1 - \lambda(\theta)) \sum_{t \leq x} f_d(t; \theta).
Here \theta is referring to the collection [\mu, \sigma] of the parameters of the underlying Normal distribution for each X_n. The first term in the sum above contains the contribution to the CDF from the continuous portion of the data, while the second term contains the contribution from the discrete part. The term \lambda(\theta) is the mixing probability, which you interpret as the probability that an observation came from the continuous part as opposed to the discrete part.
In the above, the density f_a(t;\theta) corresponds to just the density I mentioned in my initial initial answer:
f_a(t; \theta) = f_Z(z; \theta) = \frac{f_X(z; \theta)}{1 - F_X(y_n; \theta)}\chi_{z > y_n}
where \chi is just the indicator function. The discrete probability mass function f_d(t; \theta), is just the “dumb” random variable that puts all of its mass on y_n. The \lambda(\theta) parameter is actually just
\lambda(y_{n-1}, \theta) = 1 - F_X(y_{n-1}; \theta)
which was the probability of the biased coin landing on tails from my response.
Now, that StackExchange answer proposed that the likelihood function could be computed as
\mathcal{L}(\mathcal{Y}; \theta) = \prod_{i \in C} \lambda(\theta) f_a(y_i; \theta) \,\, \prod_{i \in D} (1 - \lambda(\theta)) f_d(y_i; \theta)
where we split the data \mathcal{Y} = {y_1, \ldots, y_n} into two sets, where C is the set of indices for which the observation y_i, i \in C does not equal y_{i-1} (we flipped a tails and drew from the continuous distribution) and D is the set of indices for which the observation y_i, i \in D where y_i = y_{i-1} (we flipped a heads and stuck with y_n). Note that you always know whether y_i came from the discrete/continuous draw just by looking if the running maximum changed/didn’t change. The explicit steps to do this are:
- Write a function for the logarithm of the density f_Z(z; \theta) I gave above.
- Assemble the list of \lambda(y_{n-1}, \theta)'s and (1 -\lambda(y_{n-1}, \theta))'s. These will be lists of two different lengths whose total length sums to the number of observations. Finally, take the logarithms of both lists, then sum them together and call it
log_lambdas_sum
.
- Write a master logp function that computes the logarithm of the likelihood I gave above. The log probability should end up being
log_lambdas_sum
+ the sum of the \log f_a(y_i; \theta) for each i \in C. We end up ignoring the parts for the discrete pmf since at all of the i \in D the pmf f_d(y_i; \theta) is just 1 so they drop out.
Summary
I was probably too lazy and garbled some of the notation above, but hopefully you can get a sense of how you can calculate the likelihood. It turns out, we can decompose the likelihood function for the mixed random variable into its discrete and continuous parts, and then evaluate to total likelihood by evaluating the individual likelihoods on two partitioned sets of the data and weighting by the \lambda factors. I am working on a notebook that implements this but can’t promise when I’ll get around to finishing it, so if you can see the light at the end of the tunnel I encourage you to try this yourself!