Ordered transform internals

Velochy · August 19, 2024, 10:40am

I recently trieed figuring out how the ordered transform works and was rather horrified by what I saw here.

If I’m correct, it creates order by
a) sampling the lowest value from the given distribution
b) Creating the rest of the values by adding cumsum(exp(vi)) to the original.

The problem there is, this does not result in a a distribution anywhere close to the one given for most distributions.

I get this is a pragmatic way to create an ordered variable, but is there really no reasonable way of preserving the distribution or even getting close to doing so? Or am I fundamentally misunderstanding something?

Velochy · August 19, 2024, 10:50am

As a first guess, I would assume something like this might get at least closer to preserving the original distribution:

x = pt.set_subtensor(x[..., 0], value[..., 0])
x = pt.set_subtensor(x[..., 1:], pt.abs(value[..., 1:]-value[...,:-1))
return pt.cumsum(x, axis=-1)

Granted, abs is not fully continuous, but its singular discontinuity at 0 generally does not cause any issues, at least in my experience. And it should definitely be closer to distribution preserving as it does not balloon the values to the exponential scale.

Edit: and after some experimentation and simulation, the following looks pretty good for smaller values (<5), at least for Normal, LogNormal and Exponential that I tested it with:

x = pt.set_subtensor(x[..., 0], pt.min(value,axis=-1))
x = pt.set_subtensor(x[..., 1:], pt.abs(value[..., 1:]-value[...,:-1))/2
return pt.cumsum(x, axis=-1)

ricardoV94 · August 19, 2024, 1:16pm

Not really because ordering “distorts” several univariate RVs into a “strange” multivariate RV. Similarly, if you have a real distribution and apply a “positive” (log) constraint you can never respect the original distribution. That’s why ordered is never a default transform.

You may find these other stuff useful:

github.com/pymc-devs/pymc

Implement `Ordered` distribution factory

opened 01:44PM - 02 May 24 UTC

ricardoV94

enhancements hackathon

### Description For univariate IID, adding a transform=`ordered` is equivalen…t to sorting the raw draws (forward pass). The logp is proportional to the density of the original draws + ordered transform jacobian. We would just need to figure out the normalization constant so it defines a proper multivariate variable that integrates to 1. With this users would have a generative graph for ordered variables, that they can also do prior_predictive sampling from. Also default initvals would work out of the box. Right now users always need to provide them. This would also pretty much also obliviate the need for `default_transform` and `transform` that we added in #5674 and simplify the API. The normalization constant is probably just `size!`: https://en.wikipedia.org/wiki/Order_statistic#The_joint_distribution_of_the_order_statistics_of_an_absolutely_continuous_distribution The API could look something like: ```python with pm.Model() as m: x = pm.Ordered("x", pm.Normal.dist(), shape=(3,)) ```

In that Colab you see what 3 non IID ordered Normals look like:

It’s clear that they can’t correspond to any marginal normals

jessegrabowski · August 19, 2024, 2:47pm

@bob-carpenter has been posting quite a bit on this exact issue, he might have more insight.

bob-carpenter · August 19, 2024, 3:05pm

If you apply an ordering constraint and its Jacobian, then apply a standard normal prior, you get the same distributions marginally as if you generate standard normal variates and sort them. This is what’s known as an order statistic and the order statistics do not have the same distribution as the variables that were sorted to generate them.

The Wikipedia article works out how uniform(0, 1) order statistics leads to order statistics with beta distributions.

Velochy · August 19, 2024, 3:39pm

Yes, i get that you get different distributions if you order the variables and I think ive even encountered the derivation for discrete distributions.

But this is kind of what I meant, i.e. Id expect the distribution of pm.Normal(transform=ordered) to match the distribution of normals after they were ordered.

It is very clear the current solution is very far off from that. Im just wondering if it is possible to have a better transform that st least comes close?

ricardoV94 · August 19, 2024, 3:47pm

It’s the same for IID univariate RVs. The general order statistic is not something you can implement just via a transform, you need to actually define a new multivariate distribution, with the logp obtained from order statistics.

ricardoV94 · August 19, 2024, 4:07pm

By the way PyMC does some order statistic stuff, you can ask the logp of the max of IID RVs

import pymc as pm

y = pm.Normal.dist(0, 1, shape=(3,))
pm.logp(y.max(), 0.75).eval()  # array(-0.61556477)

You cannot, however, ask the probability of a sorted tuple of RVs (although for IID it would be easy):

import pymc as pm

y = pm.Normal.dist(0, 1, shape=(3,))
pm.logp(y.sort(), [0, 1, 2]).eval()  # NotImplementedError: Logprob method not implemented for SortOp

We are interested in expanding this functionality:

ricardoV94 · August 19, 2024, 4:14pm

You should get the same “bad” draws with the abs or exp in the cumsum, because we add the respective jacobian correction for the cumsum of exp. Can you share the code where you are not getting the same?

Edit: Note that to draw from this you have to use pm.sample, since transforms play no role in pm.sample_prior_predictive or pm.draw

bob-carpenter · August 19, 2024, 5:25pm

If the change-of-variables adjustment is done correctly, the distributions will match. Here’s how it looks in Stan:

parameters {
  ordered[5] alpha;
  vector[5] beta_raw;
}
transformed parameters {
  vector[5] beta = sort_asc(beta_raw);
}
model {
  alpha ~ normal(0, 1);
  beta_raw ~ normal(0, 1);
}

And here’s the run in cmdstanpy:

import cmdstanpy as csp
m = csp.CmdStanModel(stan_file='ordered.stan')
f = m.sample()
f.summary(sig_figs=2)

and it prints

               Mean    MCSE  StdDev     5%  ...   95%   N_Eff  N_Eff/s  R_hat
lp__        -9.1000  0.0600    2.30 -13.00  ... -5.90  1500.0   8200.0    1.0
alpha[1]    -1.2000  0.0130    0.67  -2.30  ... -0.13  2564.0  14322.0    1.0
alpha[2]    -0.4800  0.0084    0.56  -1.40  ...  0.43  4400.0  24584.0    1.0
alpha[3]     0.0012  0.0077    0.53  -0.87  ...  0.85  4780.0  26704.0    1.0
alpha[4]     0.5000  0.0081    0.56  -0.42  ...  1.40  4728.0  26414.0    1.0
alpha[5]     1.2000  0.0097    0.67   0.11  ...  2.30  4711.0  26316.0    1.0
beta_raw[1]  0.0038  0.0130    1.00  -1.70  ...  1.60  5704.0  31866.0    1.0
beta_raw[2] -0.0077  0.0140    0.99  -1.60  ...  1.60  5139.0  28710.0    1.0
beta_raw[3]  0.0099  0.0140    0.98  -1.60  ...  1.60  5234.0  29240.0    1.0
beta_raw[4]  0.0270  0.0140    0.99  -1.60  ...  1.60  5037.0  28142.0    1.0
beta_raw[5] -0.0041  0.0130    0.98  -1.60  ...  1.60  5784.0  32311.0    1.0
beta[1]     -1.2000  0.0130    0.65  -2.30  ... -0.15  2633.0  14707.0    1.0
beta[2]     -0.4900  0.0088    0.55  -1.40  ...  0.39  3893.0  21750.0    1.0
beta[3]      0.0110  0.0078    0.52  -0.85  ...  0.89  4512.0  25206.0    1.0
beta[4]      0.5000  0.0098    0.55  -0.39  ...  1.40  3126.0  17466.0    1.0
beta[5]      1.2000  0.0120    0.65   0.18  ...  2.30  2829.0  15806.0    1.0

So you can see that alpha and beta have the same distribution whether you contrain the variable or sort an unconstrained variable. This is with 4K draws in 4 chains by default.

Note the pattern of standard deviations with the central element most constrained. It has a mean of 0 being in the middle, but its standard deviation is only 0.5 or so.

Velochy · August 19, 2024, 5:34pm

Ok now I am completely lost. How can these two very different transforms end up giving the same result? I guess I need a basic explanation of how transforms work, because if I just took the pm.Normal myself and ran it through those two transformations results would definitely be different. So I am missing something I guess. Can you maybe elaborate a bit on how this code being in the transform differs from just being done in the model?

ricardoV94 · August 19, 2024, 7:39pm

Did you change the log_jac_det of the transform when you changed the forward and backward methods?

If two transforms map to the same constrained space and have the right log jacobian determinant correction the draws will be identical. For instance, to map from the real line to the positives you can use the log transform or the log_exp_m1, and you should get identical results, subject to sampler convergence: pymc/pymc/distributions/transforms.py at main · pymc-devs/pymc · GitHub

The transform has 2 purposes: 1. map to a constrained space and 2. account for the distortion effect so the original density is still respected.

In the case of the ordered transform as @bob-carpenter mentioned you should get the same draws if you use the transform or generate forward draws and sort them. The same happens with Stan.

(Re: which one to use, NUTS may not like sorting the forward draws so I wouldn’t suggest it).

You can see that in my Colab notebook where I compare sorting forward draws with sampling from the prior with pm.sample. If you don’t have IID variables, ordering is more complex than just sorting and is equivalent to rejection sampling of the forward draws that aren’t sorted. That’s shown in the later examples in the notebook.

The fact that it is equivalent to rejection sampling means it is not doing anything other than enforcing the constraint. The cumsum of exp is just a handy trick to enforce the constraint in a continuous differentiable way.

The implications of the constraint are perhaps the most unintuitive aspect? You seem to want something that behaves differently than just sorting/rejection sampling. You will have to specify what you want precisely, but it doesn’t look like it can be achieved by writing a different transform that simply maps to an ordered space.

ricardoV94 · August 19, 2024, 7:50pm

Going back to your transform, it doesn’t look like it’s invertible? The abs destroys the information about which item in each sucessive pair was larger?

Velochy · August 19, 2024, 7:58pm

I think you misunderstand me a bit. I do not have a specific problem I’m encountering.

I just looked at the transform and I can’t figure out how it can work without distorting the distribution.

Because to my very limited understanding, transforms are basically equivalent to just doing the same thing after in the model. i.e. i would expect

o = pm.Normal('v',size=10,transform=ordered)
to behave the same as

u = pm.Normal('v',size=10)
v = pt.zeros(10)
v = pt.set_subtensor(v[0],u[0])
v = pt.set_subtensor(v[1:],np.exp(v[1:]))
v = pt.cumsum(v)

And I was a bit taken aback because the latter would definitely not be what I expect when I write the former.

I’m now reading from between the lines that there is in fact a difference, but how exactly it is different still eludes me. Or how the determinant of the jacobian comes into play, for that matter.

Reason I’m asking is actually rather pragmatic: I would like to write an ordered transform that allows ordering on some other dimension (similar to how zerosum_transform works), because I would actually like to have ordered set of multivariate normals in the ideal world. But I’m increasingly getting the feeling I’m a bit out of my depth in this.

ricardoV94 · August 19, 2024, 8:37pm

Most transforms should do nothing, they are an implementation detail for nuts to sample on the real line while respecting constrained priors. For positive distributions we use log transform so nuts samples on the log space and we back transform to the positive line, and ADD a jacobian correction term (which you are missing in your mental model of transforms), so that it all ends up as if we sampled the constrained distribution directly.

I have a small notebook exploring interval transform for uniform distribution: Google Colab

For those 99% of the cases it’s just a technical sampler specific implementation detail.

However, the same trick was exploited to allow the sampler to sample on a space that is ordered after constraining. This however can’t (and isn’t) used as the default transform of any distribution because there’s no common ordered distribution out there that users would be interested in.

However nothing stops users from applying a transform for a specific space to a distribution (or set of distributions) that are not naturally “aligned with that space”. You can, for example, use a log transform on a normal distribution, and you will obviously not get a distribution whose prior is normal since negative values will never be proposed (it will be a half normal instead).

The same thing is happening here with ordered, which is being applied to a set of distributions that do not live in an ordered space (they don’t even live in a multivariate space to begin with). This can be useful, so we don’t forbid it, but it’s also a common source of confusion.

ricardoV94 · August 19, 2024, 8:52pm

Regarding multivariate ordering this could be useful?:

ricardoV94 · August 19, 2024, 9:10pm

They are not. A log transform on a HalfNormal distribution:

x = pm.HalfNormal("x", transform=log)

Is equivalent to:

x_log__ = pm.Flat("x_log__")  # just a trick to get a parameter without a prior associated with it
x = pm.Deterministic("x", exp(x_log__))
log_jac_det = x  # would need to double check 
x_logp = pm.logp(pm.HalfNormal.dist(), x)
pm.Potential("x_logp", x_logp + log_jac_det)

It is not equivalent to

x_ = pm.HalfNormal("x_")
x = pm.Determinstic("x", exp(x_))

Velochy · August 20, 2024, 7:51am

This example helps a lot, thank you. And it makes your previous comments make sense too.

Basically, if I understand correctly, the idea of the transform is to have an invertible map from an unconstrained space to the actual constrained sample space, then have the sampler work in the unconstrained space while still tracking probabilities in the constrained space.

I have to admit I’m still a bit fuzzy in how exactly the jacobian determinant plays into it but I get that it sort of accounts for the geometry change the map brings on.

Thank you again for being patient with me!

bob-carpenter · August 20, 2024, 5:42pm

The Jacobian is from the multidimensional change-of-variables formula. Determinants measure volume, so the determinant of a Jacobian measures change-in-volume.

If X \in \mathbb{R}^N is a random variable and Y = f(X) \in \mathbb{R}^N, then

p_Y(y) = p_X(f^{-1}(y)) \cdot | J_{f^{-1}}(y) |,

where J_{f^{-1}}(y) = \frac{\partial}{\partial y} f^{-1}(y), and | A | is the absolute determinant of A.

Topic		Replies	Views
Complex ordering mixture models	7	236	April 25, 2024
Order statistics in PyMC3 Questions	17	3492	January 18, 2024
Sampling uniformly in a triangular support Questions	15	2686	May 3, 2018
Sampling ordered normal distributions with array-valued bounds Questions	0	337	December 7, 2020
Why does `transform=pm.distributions.transforms.ordered` lead to worse convergence? Questions	3	1551	August 27, 2021

Ordered transform internals

Related topics