I recently trieed figuring out how the ordered transform works and was rather horrified by what I saw here.
If I’m correct, it creates order by
a) sampling the lowest value from the given distribution
b) Creating the rest of the values by adding cumsum(exp(vi)) to the original.
The problem there is, this does not result in a a distribution anywhere close to the one given for most distributions.
I get this is a pragmatic way to create an ordered variable, but is there really no reasonable way of preserving the distribution or even getting close to doing so? Or am I fundamentally misunderstanding something?
As a first guess, I would assume something like this might get at least closer to preserving the original distribution:
x = pt.set_subtensor(x[..., 0], value[..., 0])
x = pt.set_subtensor(x[..., 1:], pt.abs(value[..., 1:]-value[...,:-1))
return pt.cumsum(x, axis=-1)
Granted, abs is not fully continuous, but its singular discontinuity at 0 generally does not cause any issues, at least in my experience. And it should definitely be closer to distribution preserving as it does not balloon the values to the exponential scale.
Edit: and after some experimentation and simulation, the following looks pretty good for smaller values (<5), at least for Normal, LogNormal and Exponential that I tested it with:
x = pt.set_subtensor(x[..., 0], pt.min(value,axis=-1))
x = pt.set_subtensor(x[..., 1:], pt.abs(value[..., 1:]-value[...,:-1))/2
return pt.cumsum(x, axis=-1)
Not really because ordering “distorts” several univariate RVs into a “strange” multivariate RV. Similarly, if you have a real distribution and apply a “positive” (log) constraint you can never respect the original distribution. That’s why ordered is never a default transform.
You may find these other stuff useful:
In that Colab you see what 3 non IID ordered Normals look like:
It’s clear that they can’t correspond to any marginal normals
If you apply an ordering constraint and its Jacobian, then apply a standard normal prior, you get the same distributions marginally as if you generate standard normal variates and sort them. This is what’s known as an order statistic and the order statistics do not have the same distribution as the variables that were sorted to generate them.
The Wikipedia article works out how uniform(0, 1) order statistics leads to order statistics with beta distributions.
Yes, i get that you get different distributions if you order the variables and I think ive even encountered the derivation for discrete distributions.
But this is kind of what I meant, i.e. Id expect the distribution of pm.Normal(transform=ordered) to match the distribution of normals after they were ordered.
It is very clear the current solution is very far off from that. Im just wondering if it is possible to have a better transform that st least comes close?
It’s the same for IID univariate RVs. The general order statistic is not something you can implement just via a transform, you need to actually define a new multivariate distribution, with the logp obtained from order statistics.
You should get the same “bad” draws with the abs or exp in the cumsum, because we add the respective jacobian correction for the cumsum of exp. Can you share the code where you are not getting the same?
Edit: Note that to draw from this you have to use pm.sample, since transforms play no role in pm.sample_prior_predictive or pm.draw
So you can see that alpha and beta have the same distribution whether you contrain the variable or sort an unconstrained variable. This is with 4K draws in 4 chains by default.
Note the pattern of standard deviations with the central element most constrained. It has a mean of 0 being in the middle, but its standard deviation is only 0.5 or so.
Ok now I am completely lost. How can these two very different transforms end up giving the same result? I guess I need a basic explanation of how transforms work, because if I just took the pm.Normal myself and ran it through those two transformations results would definitely be different. So I am missing something I guess. Can you maybe elaborate a bit on how this code being in the transform differs from just being done in the model?
Did you change the log_jac_det of the transform when you changed the forward and backward methods?
If two transforms map to the same constrained space and have the right log jacobian determinant correction the draws will be identical. For instance, to map from the real line to the positives you can use the log transform or the log_exp_m1, and you should get identical results, subject to sampler convergence: pymc/pymc/distributions/transforms.py at main · pymc-devs/pymc · GitHub
The transform has 2 purposes: 1. map to a constrained space and 2. account for the distortion effect so the original density is still respected.
In the case of the ordered transform as @bob-carpenter mentioned you should get the same draws if you use the transform or generate forward draws and sort them. The same happens with Stan.
(Re: which one to use, NUTS may not like sorting the forward draws so I wouldn’t suggest it).
You can see that in my Colab notebook where I compare sorting forward draws with sampling from the prior with pm.sample. If you don’t have IID variables, ordering is more complex than just sorting and is equivalent to rejection sampling of the forward draws that aren’t sorted. That’s shown in the later examples in the notebook.
The fact that it is equivalent to rejection sampling means it is not doing anything other than enforcing the constraint. The cumsum of exp is just a handy trick to enforce the constraint in a continuous differentiable way.
The implications of the constraint are perhaps the most unintuitive aspect? You seem to want something that behaves differently than just sorting/rejection sampling. You will have to specify what you want precisely, but it doesn’t look like it can be achieved by writing a different transform that simply maps to an ordered space.
Going back to your transform, it doesn’t look like it’s invertible? The abs destroys the information about which item in each sucessive pair was larger?
I think you misunderstand me a bit. I do not have a specific problem I’m encountering.
I just looked at the transform and I can’t figure out how it can work without distorting the distribution.
Because to my very limited understanding, transforms are basically equivalent to just doing the same thing after in the model. i.e. i would expect
o = pm.Normal('v',size=10,transform=ordered)
to behave the same as
u = pm.Normal('v',size=10)
v = pt.zeros(10)
v = pt.set_subtensor(v[0],u[0])
v = pt.set_subtensor(v[1:],np.exp(v[1:]))
v = pt.cumsum(v)
And I was a bit taken aback because the latter would definitely not be what I expect when I write the former.
I’m now reading from between the lines that there is in fact a difference, but how exactly it is different still eludes me. Or how the determinant of the jacobian comes into play, for that matter.
Reason I’m asking is actually rather pragmatic: I would like to write an ordered transform that allows ordering on some other dimension (similar to how zerosum_transform works), because I would actually like to have ordered set of multivariate normals in the ideal world. But I’m increasingly getting the feeling I’m a bit out of my depth in this.
Most transforms should do nothing, they are an implementation detail for nuts to sample on the real line while respecting constrained priors. For positive distributions we use log transform so nuts samples on the log space and we back transform to the positive line, and ADD a jacobian correction term (which you are missing in your mental model of transforms), so that it all ends up as if we sampled the constrained distribution directly.
I have a small notebook exploring interval transform for uniform distribution: Google Colab
For those 99% of the cases it’s just a technical sampler specific implementation detail.
However, the same trick was exploited to allow the sampler to sample on a space that is ordered after constraining. This however can’t (and isn’t) used as the default transform of any distribution because there’s no common ordered distribution out there that users would be interested in.
However nothing stops users from applying a transform for a specific space to a distribution (or set of distributions) that are not naturally “aligned with that space”. You can, for example, use a log transform on a normal distribution, and you will obviously not get a distribution whose prior is normal since negative values will never be proposed (it will be a half normal instead).
The same thing is happening here with ordered, which is being applied to a set of distributions that do not live in an ordered space (they don’t even live in a multivariate space to begin with). This can be useful, so we don’t forbid it, but it’s also a common source of confusion.
They are not. A log transform on a HalfNormal distribution:
x = pm.HalfNormal("x", transform=log)
Is equivalent to:
x_log__ = pm.Flat("x_log__") # just a trick to get a parameter without a prior associated with it
x = pm.Deterministic("x", exp(x_log__))
log_jac_det = x # would need to double check
x_logp = pm.logp(pm.HalfNormal.dist(), x)
pm.Potential("x_logp", x_logp + log_jac_det)
It is not equivalent to
x_ = pm.HalfNormal("x_")
x = pm.Determinstic("x", exp(x_))
This example helps a lot, thank you. And it makes your previous comments make sense too.
Basically, if I understand correctly, the idea of the transform is to have an invertible map from an unconstrained space to the actual constrained sample space, then have the sampler work in the unconstrained space while still tracking probabilities in the constrained space.
I have to admit I’m still a bit fuzzy in how exactly the jacobian determinant plays into it but I get that it sort of accounts for the geometry change the map brings on.
The Jacobian is from the multidimensional change-of-variables formula. Determinants measure volume, so the determinant of a Jacobian measures change-in-volume.
If X \in \mathbb{R}^N is a random variable and Y = f(X) \in \mathbb{R}^N, then
p_Y(y) = p_X(f^{-1}(y)) \cdot | J_{f^{-1}}(y) |,
where J_{f^{-1}}(y) = \frac{\partial}{\partial y} f^{-1}(y), and | A | is the absolute determinant of A.