Complex ordering mixture models

monkysee · April 21, 2024, 9:47pm

Hi there.

I’m hoping to make a more complex set of ordering relations than I think the ordered transform may be prepared to do.

First I’d like to figure out in the forward call for the ordered transform - what exactly is going on - to my understanding we’re subtracting the last column’s value (the largest)? from every other column. I’m not entirely sure how this prevents inbetween reorderings though or entirely what’s going on here but I’m a pymc newbie too.

Ultimately what I’m wanting to do is have a set of Mx2 normal distributions where the following ordering relations exist:

u[m,0] < u[m, 1]
u[m,0] < u[m+1, 0]
u[m,1] < u[m+1,1]

No ordering between u[m, 1] and u[m+1, 0] is guaranteed.

I’m not sure what I should try looking at ordered’s implementation, the general take on potential fields is they won’t work well for this - with this many relationships especially.

class Ordered(Transform):
    name = "ordered"

    def __init__(self, ndim_supp=None):
        if ndim_supp is not None:
            warnings.warn("ndim_supp argument is deprecated and has no effect", FutureWarning)

    def backward(self, value, *inputs):
        x = pt.zeros(value.shape)
        x = pt.set_subtensor(x[..., 0], value[..., 0])
        x = pt.set_subtensor(x[..., 1:], pt.exp(value[..., 1:]))
        return pt.cumsum(x, axis=-1)

    def forward(self, value, *inputs):
        y = pt.zeros(value.shape)
        y = pt.set_subtensor(y[..., 0], value[..., 0])
        y = pt.set_subtensor(y[..., 1:], pt.log(value[..., 1:] - value[..., :-1]))
        return y

    def log_jac_det(self, value, *inputs):
        return pt.sum(value[..., 1:], axis=-1)

ricardoV94 · April 22, 2024, 7:59am

The backward pass is the one that ensures you get ordered draws from unconstrained draws. The forward takes you from the constrained to the unconstrained

monkysee · April 24, 2024, 6:45am

ok so forwards recovers the values from the cumsum by taking the difference between neighbos - not sure how I missed that but I think some others have. It’s tight code though.

The fun bit is we exp everything but the top left entry in each of the values submatrix.

Will pymc deal with a 2d shape of RV tensors or will I need to reshape it back/forth from a 2d version?

It looks like what I will want to do is cumsum up all of (in the 2d interpretation of the subtensor… tiles) - column 0, then add that to column 1. There I realize I have a few different things I can do or rather nothing works in exactly the desired way - cumsum column 1, and leave that value alone - or take cumsum of column 1 and cumsum across the rows. It probably makes the most sense to cumsum just column 0 and add it to the exp(value[…]) of column 1 (no cumsum)… ah but then it really seems like there’s no ordering preserved between the column 1 rv and the next row’s column 0.

So like (if 2d subtensor shapes are allowed)

x_r = cumsum(x, axis=-2)
x_r[..., :, 1] = 0
x_c = cumsum(x, axis=-1)

fwdresult  = x_r + x_c

should satisfy:
u[m,0] < u[m, 1]
u[m,0] < u[m+1, 0]

But I don’t get:
u[m,1] < u[m+1,1]

fwdresult = cumsum(x, axis=-1)

Should satisfy
u[m,0] < u[m+1, 0]
u[m,1] < u[m+1,1]

But I don’t get:

u[m,0] < u[m, 1]

Seems like there is one more trick to get that 3rd relation but I only see adding the same count of exp(value) terms to both the u[m, 1] entry and the u[m+1, 0] entry so they’d be basically same neighborhood.

ricardoV94 · April 24, 2024, 7:53am

Maybe @lucianopaz or @aseyboldt will spot a nice solution immediately

aseyboldt · April 24, 2024, 4:47pm

I think something based on cumsum can not work unfortunately. The argument why involves some convex geometry, and that’s not really my field, so take this with a grain of salt…

We can find out a bit more about the space of vectors u that fulfill your requirements:
First, we transform u into a one dimensional vector x (ie just u.ravel()). We can then find a matrix A such that x is valid if and only if Ax>0. The matrix A has one row per constraint, and each row will contain one -1 and one 1:

N = 50
idxs = np.arange(2 * N, dtype=int).reshape((N, 2))

# Find the indices of `v` for each inequality
ineqs = []
ineqs.extend(zip(idxs[:, 0].ravel(), idxs[:, 1].ravel()))
ineqs.extend(zip(idxs[0:-1, 0].ravel(), idxs[1:, 0].ravel()))
ineqs.extend(zip(idxs[0:-1, 1].ravel(), idxs[1:, 1].ravel()))

# Collect all inequalities in a sparse matrix
ineq = sparse.lil_array((2*N, 2*N), dtype=np.int8)
A = sparse.lil_array((len(ineqs), 2*N), dtype=np.int8)
for i, (a_idx, b_idx) in enumerate(ineqs):
    A[i, a_idx] = -1
    A[i, b_idx] = 1

Is this case A.shape = (148, 100), so we have 148 inequalities, and n=100 = 2N dimensions.

Spaces like C = \{x\in \mathbb{R}^n\mid Ax > 0\} are called polyhedral (convex) cones and are studied in convex geometry, and there is quite a bit of literature (and software) for those. (Convex cone - Wikipedia).

There are two common ways to represent those cones: One is using the matrix A, but we can also find a set of generating vectors v_i, such that

C = \{ x\in \mathbb{R}^n \mid \sum \lambda_i v_i, \lambda_i > 0 \}

So those v_i act like a basis, and you can reach every point it C using a linear combination of the v_i with positive coefficients. But there is a crucial difference between a basis and this set of generating vectors: With a basis, there is only ever one way to reach each point in the space, while for a polyhedral cone the coefficients might not be unique. There might be (considerably) more vectors v_i than n. We can find the minimal set of v_i using some special software like ppl or normaliz, for instance:

import ppl

variables = [ppl.Variable(i) for i in range(2*N)]
cs = ppl.Constraint_System()

for idx1, idx2 in ineqs:
    cs.insert(variables[idx2] - variables[idx1] >= 0)

generators = ppl.C_Polyhedron(cs).minimized_generators()

line, point, *rays = generators
assert line.is_line()
assert point.is_point()
for ray in rays:
    assert ray.is_ray()

rays = np.array([ray.coefficients() for ray in rays]).astype(float)
line = np.array(line.coefficients()).astype(float)

And here rays.shape == (1324, 100), so unfortunately we have quite a few more generators than n. (We also have an additional line generator, with that ppl is telling us that we can also just add a real number to all elements of x, and a point generator. I have no clue what that one is about, it looks like this: point(0/1, 0/1, 0/1, 0/1,...) (edit It means the origin point, and uses rational numbers to show it…) and I’d be curios if someone knows more). For each individual point x\in C we can always use only n of those generators to reach it (ie set all but n coeficients to zero), but which one we are allowed to set to zero will depend on x.

Ideally we would like to find a linear bijective function that maps C to \mathbb{R}_+^n, that we can use as a transformation during sampling. (cumsum is linear, so what you are trying to do with that should be a special case of the linear transformation). But because the number of generators is larger than n such a linear function can’t exist (if it did, than the image of the basis vectors e_i would be a smaller generating set than the one we found).

So I guess that leaves 3 options to solve this problem:

Simply don’t map the parameter space to something that is unconstrained. You can use a custom dist or a potential to set the logp to -inf if a draw is outside the allowed parameter space (so something like pm.Potential("constrain_u", pt.switch(pt.all(pt.geq(A @ u.ravel(), 0), 0, -inf)) and provide a starting point for the sampler that is valid. You than pretty much have to pray that the posterior is far away from the border of the valid parameter space, because otherwise the sampler will diverge.
Use the generating set of vectors (especially if N is not too big). This is an overparametrization, so if your dataset constrains the parameters u very well, this will probably lead to sampling issues because of posterior correlations. You can get draws from u using something like this (but you should probably do some prior predicive sampling to figure out good priors on lam and const):

lam_log = pm.Normal("lam_log", shape=len(rays))
lam = pt.exp(lam_log)
const = pm.Normal("const")
u = ((rays.T @ lam) + const * line).reshape((N, 2))

# or plain numpy to get a feel for it:
lam = np.exp(np.random.randn(len(rays)))
const = np.random.randn()

(rays.T @ lam + const * line).reshape((N, 2))

array([[-23.42348561, -19.39325472],
       [-21.19383067, -12.53969962],
       [-19.18474236,  -0.54548926],
       [-12.09226225,  11.05295243],
       [  1.92443848,  13.05133014]])

Somehow figure out a non-linear transformation?
Prove me wrong and show that there is a linear one after all

Personally, I’d probably try to get away with the second if N isn’t too large, and only if that doesn’t work try to do the first.

aseyboldt · April 24, 2024, 4:51pm

And if anyone wants a graphical illustration of what’s happening here:

No idea what that’s supposed to be, but that’s what gpt came up with and it kinda looks cool I guess?

Here’s a graphical illustration showing the matrix ‘A’ and the resulting polyhedral cone. The diagram includes the matrix with specific values highlighted and a 3D representation of the cone, visually connecting the constraints in the matrix to the shape of the cone. You can see how the entries in the matrix correspond to the geometry of the cone.

ricardoV94 · April 25, 2024, 10:58am

@aseyboldt the original transform is not linear either, it’s a cumsum of the exponentiated values. I think the author had the same kind of process in mind when they mentioned cumsum.

Does that change the difficulty? Is there an easy non linear bijective transform that could satisfy the constraints?

aseyboldt · April 25, 2024, 1:25pm

It is a linear function from \mathbb{R}^n_+ (so only positive values) to C, the exponential just makes sure we have a point in the positive quadrant. I also didn’t prove it formally, so maybe I’m missing something, but I’d be very surprised if there is such a function.

But I think I’ve got an idea about how what a nonlinear function might look like (only a rough sketch):

First, we can add u[0, 0] > 0 (we could probably also just set it to zero and fix that with an additional parameter later…)

Then, we add another requirement, sum(x) = 1. This makes sure that our space is a bounded polyhedron.

We can build the matrix A:

ineqs = []
ineqs.extend(zip(idxs[:, 0].ravel(), idxs[:, 1].ravel()))
ineqs.extend(zip(idxs[0:-1, 0].ravel(), idxs[1:, 0].ravel()))
ineqs.extend(zip(idxs[0:-1, 1].ravel(), idxs[1:, 1].ravel()))

A = sparse.lil_array((len(ineqs) + 1, 2*N), dtype=np.int8)
for i, (a_idx, b_idx) in enumerate(ineqs):
    A[i, a_idx] = -1
    A[i, b_idx] = 1
A[-1, 0] = 1

Where I left out the sum(x) = 1 property, it only has the inequalities.

Based on this we can now compute the centroid:

import ppl

variables = [ppl.Variable(i) for i in range(2*N)]
cs = ppl.Constraint_System()

for row in A.toarray():
    cs.insert(sum(int(val) * variables[i] for i, val in enumerate(row)) >= 0)

cs.insert(sum(variables) == 1)

generators = ppl.C_Polyhedron(cs).minimized_generators()

for point in generators:
    assert point.is_point()

# All ~1300 vertices of the polyhedron
points = (
    np.array([point.coefficients() for point in generators]).astype("d")
    / np.array([point.divisor() for point in generators])[:, None].astype("d")
)

centroid = points.mean(0)

(We could probably do this somehow without having to enumerate all vertices)

Now we map this polyhedron to X = \mathbb{R}^n_{\sum = 0} (the subspace where all elements sum to zero). We identify the centroid with 0. Given a point x \in X we use it to encode a direction in C from the centroid, and a length, by splitting it into its norm and a unit-norm vector. Given the direction we can quite easily compute at what distance we run into one of the sides of the polyhedron (the minimum distance to any of the plains given by A and the centroid along the given direction):

offset = np.random.randn(2*N)
offset = offset - offset.mean()  # This could be done using the zerosumnormal

offset_norm = np.linalg.norm(offset)
offset_direction = offset / offset_norm

A_ = A.toarray().astype("d")

dists = -(A_ @ centroid) / (A_ @ offset_direction)
dists = np.where(dists < 0, np.inf, dists)

min_dist = dists.min()

We then map to C by walking tanh(length) * min_dist in the given direction, so that for length \to \infty we will reach the border of the polyhedron.

point = centroid + np.tanh(offset_norm) * min_dist * offset_direction

We can then scale the point by an arbitrary positive value to get rid of the sum(1) = 1 requirement again.

Bit of a mess currently, but this might also work. And surprisingly it scales only with the number of the inequalities, not the number of generators (assuming we can avoid using them to compute the centroid, or if we use any other valid point as the center).

The map unfortunately doesn’t have a smooth derivative because of the max, but maybe that too can be fixed.

And sorry for the rough explanation, I might revisit this a bit later. It would be nice if we could generalize this for other convex constraints…

Topic		Replies	Views
Fitting multiple cones to 2D data, how to impose ordering constraint on parameters v5 modeling	15	458	September 1, 2023
Ordered transform internals development , bug	18	122	August 20, 2024
Feature proposal: poset_order transform	13	120	April 15, 2025
Error using Ordered() in Mixture of Uniform Distributions Questions	14	1835	February 7, 2019
Order statistics in PyMC3 Questions	17	3492	January 18, 2024

Complex ordering mixture models

Related topics