Feature proposal: poset_order transform

Not sure who to ask so this seemed like as good a place as any.

For my own usage, I have built a version of ordered transform that allows one to specify a partial order with an adjacency matrix and then forces the values to conform to it.

My question is: would this be something useful for the whole community, in which case I could clean the code up and do a PR.

On the one hand, it sounds rather niche. But as it is a very flexible tool, it might have quite a few uses. My own use right now is for inferring a Thurstonian rank order model in the case where the inputs have ties or for combining the answers from multiple maxdiff style questions, but my guess is there are quite a few other uses if the tool is added to the toolbox.

Sounds useful, can we see how it looks like :)?

Sure :slight_smile:

The code in present form works with adjacency lists instead of adjacency matrix and assumes the values are already topologically sorted.

So “cleaning up” the code would introduce running a toposort + topological reduction on the adjacency matrix and then converting from adjacency matrix to adjacency list form used presently. Probably also a class method to generate initial values as they already need to conform to the order just like for regular ordered transform, and this requires they be in same order as toposort.

from pymc.logprob.transforms import Transform

def get_minval(dtype):
    return np.iinfo(dtype).min if np.issubdtype(dtype, np.integer) else np.finfo(dtype).min

class DagOrdered(Transform):
    name = "dag_ordered"

    def __init__(self, dag):
        """Create a DagOrdered object

        Assumes values are in topological sort order

        Parameters
        ----------
        dag: adjacency lists for the dag as an n x k array with -1 standing in for None
        """

        self.dag = dag
        self.is_start = np.all(self.dag[...,:,:]==-1,axis=-1)

    def backward(self, value, *inputs):
        x = pt.zeros(self.dag.shape[:-2] + (self.dag.shape[-2]+1,))
        x = pt.set_subtensor(x[...,-1],get_minval(value.dtype)) # Functional infinity, as real inf creates nan when multiplied with 0

        # Indices to allow broadcasting the max over the last dimension
        idx = np.ix_(*[np.arange(s) for s in self.dag.shape[:-2]])
        idx = tuple( np.tile(i[:,None],self.dag.shape[-1]) for i in idx ) 

        # Has to be done stepwise as next steps depend on previous values
        for i in range(self.dag.shape[-2]):
            ist = self.is_start[...,i]
            mval = pt.max(x[(Ellipsis,) +idx + (self.dag[...,i,:],)],axis=-1)
            x = pt.set_subtensor(x[...,i], ist*value[...,i] + 
                                            (1-ist)*(mval + pt.exp(value[...,i])))
        return x[...,:-1]

    def forward(self, value, *inputs):
        y = pt.zeros(value.shape)


        vx = pt.zeros( self.dag.shape[:-2] + (self.dag.shape[-2]+1,),dtype=value.dtype)
        vx = pt.set_subtensor(vx[...,:-1],value)
        vx = pt.set_subtensor(vx[...,-1], get_minval(value.dtype))

        # Indices to allow broadcasting the max over the last dimension
        idx = np.ix_(*[np.arange(s) for s in self.dag.shape[:-2]])
        idx = tuple( np.tile(i[:,None,None],self.dag.shape[-2:]) for i in idx )

        y = self.is_start*value + (1-self.is_start)*(pt.log(value - 
                pt.max(vx[(Ellipsis,) + idx + (self.dag[...,:],)],axis=-1)))

        return y

    def log_jac_det(self, value, *inputs):
        return pt.sum(value*(1-self.is_start), axis=-1)

And as a current usage example:

# Basic test of DagOrdered transform
import numpy as np
dag = np.array([[[-1,-1,-1],[0,-1,-1],[0,-1,-1],[1,2,-1]],
                [[-1,-1,-1],[0,-1,-1],[1,-1,-1],[2,-1,-1]]])
dord = DagOrdered(dag)

import pymc as pm
with pm.Model() as model:
    x = pm.Normal('x',size=(2,4),transform=dord,initval=np.array([[0,0.1,0.2,0.3],[0,0.1,0.2,0.3]]))
    idata = pm.sample()
idata.posterior.x.mean(['chain','draw'])

@ricardoV94 any verdict on if this is interesting, and if so then if it should be included in pymc or pymc_extras?

Still probing interest but we can add it to pymc-extras without much concern

fwiw the maxdiff application sounds interesting. A fleshed out example might make it more directly compelling? @Velochy ?

Thank you for showing interest.

The trouble is, maxdiff is a pretty complicated model, and I’m not working with a standalone example but in an existing toolkit framework that handles a lot of the complexity for me.

Creating a standalone example of this would be a lot of work, but more importantly, it would still likely be so long and complex as to not be very educational except to those very determined to figure it out.

I’ll need to think if I can simplify it down enough to be educational. Maybe I can. I definitely agree a good practical example to go with the code would likely be very useful :slight_smile:

Created a PR for the pymc_extras repo: Added a PartialOrder transform by velochy · Pull Request #444 · pymc-devs/pymc-extras · GitHub

That’s really interesting. You can just have lower and upper bounds for each of the items composed of the maxes of smaller elements as a lower bound and mins of larger elements as upper bounds. Have you seen what it does to posterior geometry? Sometimes these hard interval constraints can be nasty for Hamiltonian Monte Carlo sampling and different choices of parameterization can make a huge difference (we just found that with sum-to-zero constraints and simplex constraints, where there were order of magnitude performance differences among alternative transforms).

If the underlying values are continuous, how do you get ties? I’d have thought that’d be a measure-zero subset you didn’t have to worry about.

Out of curiosity, do you know how this Thurstonian model compares to Plackett-Luce, which as far as I can tell, is trying to do something similar? The Wikipedia page points to Bradley-Terry, which is the 2D specialization of Plackett-Luce (or you can think of PL as the general ranking generalization of Bradley Terry).

The Wikipedia has some really bad advice—nobody should be estimating this model with Gibbs!

You could knock me over with a feather right now

Actually, I only do one-way bounding as I create all non-root nodes as max(parents) + exp(rv)
This is mainly because I built this by analogy to the original ordered transform.
Also because Jacobian is very simple in this case, which I believe it would not be for two-way bounding. There is a major downside of course - as it currently stands, backward transform needs a for-loop, which is likely to be rather inefficient for long orders.

As for posterior geometry, I’m not sure how to properly explore that, but the examples I have run with it (which are mainly partial orders on around ~10 elements) seem to sample just fine so I’m guessing it’s not too bad? Would be very interested in all the advice and pointers you can give here to verify this and/or to make it better.

As for ties, I think I might have phrased things in a slightly misleading way. The partial_order itself does not allow for ties (for you are right, for continuous RV-s they form a measure zero set). What I meant was that the inputs I get have ties, which I interpret not as proper ties but as measurement error. To clarify, I’m basically trying to extract order out of a set of likert scales, so a person might say +2 for a,c, +1 for b, 0 for e and -2 for d, f, in which case I model this information as a partial order of {a,c} > b > e > {d,f} where a and c are incomparable to each other as are d and f. This seems like a sensible assumption under the circumstances.

If you wanted to actually model ties, I would look at Henderson “Modelling and analysis of rank ordered data with ties via a generalized Plackett-Luce model” (2022) that replaces the continuous distribution for a discrete geometric distribution, which gives ties a proper probability. Despite him also deriving a Gibbs sampler, the model looks like it is very amenable to HMC, and could make for a nice summer of code project for someone to implement.

As for how the model compares to Plackett-Luce: great question. I am working on a tool for questionnaire analysis and so far we have mainly been using PL for our order-based questions (maxdiff, top3, etc), but because that model is probabilistic in its output, we ran into a wall with it when we tried to extract more complex order statistics on the posterior (such as probability of a choice being in top N for a respondent, for instance). The allure of Thurstone model is that it returns an actual order per each draw (double np.argsort over the real values), so we can calculate any statistic we want simply by averaging over posterior.

I first implemented an (allegedly common) Thurstone approximation where for each comparable pair a>b I add the log of the normal cdf for (mu[a]-mu[b])/(sd[a]+sd[b]) to the potential. This gives very similar results to PL. I am now trying to implement a proper Thurstone, and my first tests already tell me it is subtly but noticeably different to the approximation - but I will need to put a fair bit more work in before I can run it on our more complex datasets to give a proper comparison. So I guess ask me again in a couple of months (as this is a low-priority side project, unfortunately)

That makes sense if the final range is unbounded above or below.

The Jacobian for two-way bounding is also simple if you lay them out in order. So if you have constant1 < A < B, C < D < constant2, then you can set it up so that A, B, C, and D get lower bounds, but only D gets an upper bound. It’s a bunch of independent transforms so the Jacobian’s just the product of the individual ones, which is why the upper and lower bound for D isn’t a problem.

That makes sense. Any time you round a continuous measurement this is potentially a problem.We often get rounded measurements of continuous processes that have this issue.

With real people, I’d expect to see inconsistencies if you don’t have all of the items presented for ranking at once.

I’m acquainted with Plackett-Luce, but not the generalized form. I’ll take a look. Other than combinatorial discreteness or multimodality, most models are amenable to HMC.

This is straightforward to do. Each posterior draw has a ranking and you can combine them into an expectation with uncertainty in the usual way. For any individual, you can get a posterior distribution over their ranks, then look at things like the expectation that the rank is above a given level.

If you go back to the original Plackett paper, it starts with the motivation of extracting whether a horse would win (finish first) -or- place (finish second) -or- show (finish third)—he literally starts the paper with the horse race example.

Nice—those can be the most fun.

You do have a point about Jacobians still remaining tractable then, yes. But like you said, its a combination of two independent transforms - in this case my partial_order + bounded_transform. So no point in me implementing it into it. So it makes sense to keep mine unbounded on both ends :slight_smile:

Yep. And we do. My plan is to resolve them in favor of earlier responses, as these are more likely to have thought put into them, but yes, this is a genuine issue for all order models :slight_smile:

Thank you for pointing it out. I had a look at it and it’s a pretty interesting read. But he still does not seem to get around the factorial complexity of the computation as we move from first 3 to maybe first 7?

In any case, my point was not that it is impossible to calculate the order statistics, but rather that it is a lot more work to figure out for each order statistic, whereas for Thurstone the computation is very straightforward for pretty much anything you can imagine. So it is a lot more convenient in that sense :slight_smile:

They can. I only wish I did not have 10 of them in parallel going on :smiley:

@bob-carpenter actually got to the point where I can run our analyses with the new model.
So here is a plot of average values from three different models:
Data is a maxdiff survey where we asked each respondent to choose the most and least favorite option from randomly generated blocks of 5 on 8 different screens. In total we had 18 options to choose from.

Here are the results. Each boxplot is for the favorability of one item, ordered from most liked to least.

Light blue - old model, i.e. Pairwise approximation to Thurstone model, with potentials separate for each of the 8 screens.
Dark blue - still Pairwise approximation, but now the partial orders from 8 screens are combined into a single partial order for each respondent
Red - full Thurstone model using the PartialOrder transform on the combined order

As you can see, the results broadly align, but there are some notable differences in opinion in the middle.

PO based model currently also shows wider error bars, although I believe this is a byproduct of an extra degree of freedom (as I use zerosum for the gaussian approximation). I’ll try combining the transforms and see if it changes the picture