Breaking symmetries in Bayesian MDS

I have an implementation of Bayesian multi-dimensional scaling (MDS), which suffers from the problem (similar to Unique solution for probabilistic PCA) that there are infinitely number of equally probable coordinates for the items, due to the fact that MDS solutions are invariant to translation, rotation, and reflection. My current approach is to put a prior one one item nudging it towards the origin (eliminating translation symmetry), a prior on an second item nudging it towards the first principal axis (eliminating rotation symmetry), and I’m not sure how to handle reflection.

If this is just a matter of different chains giving me different solutions, I can always align them after sampling. But I am concerned that different symmetric solutions might get explored within-chain, in which case my means and covariances will be meaningless.

Are there any general principles for dealing with this sort of thing?

The short answer is “constrain that second point to be 0 with regards to one axis, and strictly positive with regards to the other”.

Then the longer answer/commentary.

I know nothing about Bayesian MDS, other than I googled MDS, read the Wikipedia definition, and saw some scatter plots of points that all seem to very much want to be removed, and vaguely equidistant, from the origin (a “donut” distribution). So, there is a high risk of there being gaps in my understanding in my answer below. Certainly I wonder what baseline prior you’d use for these results?

The approach with fixing the four symmetries (you don’t mention identity symmetry, or label switching, which is the fourth — maybe that’s because you’re handling very few dimensions) will I suspect be guided with the intent of the model.

In the case of PCA, I would nudge towards sparsity on the resulting matrix via a single prior with high density around the origin and centered around it, thus strongly reducing the number of candidate results, and doing away in principle with rotation and translation. There will remain multiple valid, roughly as sparse and entirely asymmetrical, solutions, which can be addressed by specifying the model as a dirichlet-mixture of such independent solutions. The benefit of this approach is that it leads to solutions that are elegant to interpret as the factors are almost guaranteed to be orthogonal. Resolving the reflection symmetry is approached by fixing one point to always be strictly positive (across all candidate solutions, and all axis). However the point must be well chosen to be guaranteed to be far removed from all axis, which goes against the principle of a well-balanced orthogonal result, so there might be a need to apply this constraint on a variety of points, for different axis each. This, in addition, is also required to fix label switching (if one is bothered by it at all, as it less likely to occur within a chain). When one is looking for exploratory “factor” solutions to a data set, with such solutions lending themselves to critical examination, and for a measure of the relative likelihood of the candidate solutions, this approach delivers that. The identification of points to make strictly positive as to fix the identity and rotation of each factor is, however, tremendously tedious, as is the selection of the number of solutions, and dimensions, to allow.

However from my (close to in-existent) understanding of bMDS, you would appear to be seeking everything but sparsity in the result — really you would want the opposite, in what I described as a donut prior, so it’s quite likely this wouldn’t work at all.

On the interventions you mention I can say this:

  • Don’t nudge your constraints (any zeroes) with priors: integrate it to the model definition (they’re ‘observed constraints’, in a way)

  • To solve rotation, you have to constrain as many points to each axis as there are dimensions - 1 (i.e. your solution works for 2 dimensions only)

  • Centering the prior as I would do for PCA, as opposed to forcing any one point to be the origin whilst making the effective center of the prior a random variable itself (is that what you’re doing?), seem entirely equivalent approaches to me.

To summarise: your approach, with the addition of a positivity constraint, works for 2 dimensions, however I’m unsettled by translation symmetry being a concern, as I would expect the center to have some constraints in any (if trivially) preferred solution. Perhaps only the distance between the points really matters, then I can see how fixing one point to be the origin would reduce jitter as opposed to any softer constraints imposed to all points by means of a prior.

1 Like

Thank you @gBokiau for the detailed reply!

For background, the research question is “how can I embed objects in a low-D space, where all I know about the objects is how (pair-wise) dissimilar they are from each other” and in practice the observations of pair-wise dissimilarity are Bernoulli trials where 1 is “subject thinks the two objects are the same” and 0 is “subject thinks the two objects are different”, and I have many trials for most pairs of objects. The model is basically obs[i,j,n] ~ Bernoulli(p[i,j]) for the nth (binary) observation of dissimilarity between objects i and j, and p[i,j] is some function (something that maps [0,Inf] to [0,1]) of the distance between i and j in the low-D space.

Regular MDS tries to move all the objects around in the low-D space until the pair-wise estimated distances are all as self-consistent as possible (called minimizing the strain), and typically assumes deterministic, continuous estimates of dissimilarity. Bayesian MDS (which has no standard implementation) allows for improvements like having really unbalanced observations (many more or less for some pairs i,j), priors on all distances or specific distances based on other information, and of course accommodating noisy observations (of which a Bernoulli trial is certainly one example), and even a prior on dimensionality itself.

I am starting with 2 dimensions, so indeed only rotation symmetry needs to be handled for now. You said that I should use “observed constraints” in the model definition in one point, and then later said that centering the prior seemed entirely equivalent, so I’m not sure which you are recommending, but concretely this would be either:

  • One specific point observed at (0, 0) for translation, another specific point at (0, whatever) for rotation.
    vs.
  • Prior for something (e.g. the center of mass) with mean (0,0) for translation, another prior for something else (e.g. the angle of the vector sum) with mean 0 for rotation.

Which do you think is better? And yes, only the distance between points (for all pairs of points) really matters in the end, and if needed I could always align the results from different chains before combining them.

Lastly, do you have a recommendation for the positivity constraint that eliminates reflection symmetries? This is a hard constraint and I was worried about the infinite gradient that this would imply.

Glad to I see I wasn’t absurdly far off then :slight_smile:

As regards your last question, halving whatever prior is set on the remaining points is appropriate (Half-normal, half-cauchy,…) to impose symmetry.

However! It has since occurred to me (indeed demonstrating how alien it feels to not care about the center), that you’ve only solved half of the reflective-symmetry problem if you’ve set one point of the origin, and another to (0, y). The solution is then still subject to mirroring around X. Hence I came to the following solution, now fully embracing the discomfort, which I now find overall more elegant.

You could set two (almost) random points at arbitrary locations (1,0) and (0,1) respectively, make the scale of the solution a RV with a non-informative prior, and nudge the center of the mass to be firmly below the line defined by y=1-x with some clever, and otherwise very uninformative prior. In fact, I would specifically implement a Potential that is -inf if the mean mass’ coordinate has an y > 1-x, and 0 otherwise. You must then only worry that the center’s area of likelihood does not overlap with y=1-x, hence that its sufficiently triangular with respect to the two points. (hence the almost random). Extra caveat: if the distance between these very specific two points happens to be very uncertain your traces will make waves, as the scale RV goes up and down.

But the solution is otherwise truly unique.

However 2! If you intend to use batches, this doesn’t work at all, as you keep repeating two points in every batch which can significantly bias the model. Then the second approach you suggest seems eminently sensible. I only wonder how noisy something like the angle of the vector sum would be within-sample or from batch to batch. You’re entering experimental territory as far as I’m concerned but I like it.

Try both?

1 Like

Or rather, start with the first, whilst tracking some statistics that are likely sufficient candidates for rendering the solution unique.