Constrain one variable to be greater than another

I have a model where there is this mixture-esque deterministic equation

(p * exp(-k1*d)) + ((1-p) * exp(-k2*d))

where p, k1, and k2 are stochastic variables and d is data. The samples being returned are seemingly sensible, it’s just there’s no chain convergence. For example, samples of chain 1 of k1 seem to be similar to samples of chain 2 of k2, where what you really want are chain 1 and chain 2 samples of k1 to be the same. Hope that makes sense?

I think the solution is to constrain k1<k2. Is this possible in PyMC3?

Or is it simpler to re-parameterise? Perhaps replacing k2 with k1+b where b is positive, to give

(p * exp(-k1*d)) + ((1-p) * exp(-(k1+b)*d))

Yes, you are seeing label switching in your trace. It happens a lot with mixture models.
One of the must-reads related to this topic is Michael Betancourt’s Stan case study Identifying Bayesian Mixture Models. I have a Gist here replicating all the experiment in PyMC3.

As an experiment, you can also switch the trace by hand. We have a discussion here in this post, with code here.

Something related is if you are sampling an Ordered and bounded vector (e.g., x1 < x2 ~ Uniform), current implementation seems to be biased (Sampling uniformly in a triangular support). But for continous RV it should be fine.

2 Likes

Awesome. Have implemented reparameterization and it seems to be working just fine :slight_smile:

1 Like

Lol, it was working fine on the basis of some posterior prediction. But then I saw this


The original question is still answered :slight_smile:

Just looks like it’s not going to be simple to make this model work

Note to future self
Yes, you can use a transformation (as pointed out by @junpenglao) with example here. Initially it looks a bit ARGH, but it’s pretty simple:

  • initially I had two variables k1 and k2 and wanted them to be ordered.
  • solution 1 was to reparameterize so that k2=k1+b where b>0. This seemed to work judging by some posterior predictive plots, but the chains were in fact all over the place.
  • so solution 2 was to use the ordered transform in the link above. It’s pretty simple… just add the Ordered class, then add something like , transform=Ordered(), testval=[0.1, 0.2] into the definition of the PyMC3 stochastic variable

More details at #2 on the repo

2 Likes

This is very interesting

It’s clear to me why solution #1 produces OK predictions but poor parameter estimates. Essentially there is too much prior density on low values of b (perhaps the prior was a half-normal or something). When b is close to 0, the value of k1 needs to be close to what k2 should be.

Could someone provide an intuitive explanation for why solution #2 works better?

1 Like

Yes, that makes sense in fact. Yep, it was a HalfNormal on b, and looking at the traces, if I had the sigma higher then it could have worked and perhaps masked the more fundamental problem?

If you ran it with a uniform prior on b, say b ~ N (0, 100), I imagine you would get better mixing. I would like to understand intuitively why solution #2 works, but I don’t currently.

Likewise. Can anyone else help us with the intuition?

IMO the proper explanation is in Betancourt’s case study http://mc-stan.org/users/documentation/case-studies/identifying_mixture_models.html
Especially these two paragraphs:

This identification also has a welcome geometric interpretation. The region of parameter space satisfying a given ordering constraint, such as α_1≤ … ≤ α_K, defines a square pyramid with the apex point at zero. The K-dimensional parameter space neatly decomposes into K! of these pyramids, each with a distinct ordering and hence association with a unique labeling.

When the priors are exchangeable the mixture posterior aliases across each of these pyramids in parameter space: if we were given the mixture posterior restricted to one of these pyramids then we could reconstruct the entire mixture distribution by simply rotating that restricted distribution into each of the other K!−1 pyramids. As we do this we also map the mode in the restricted distribution into each pyramid, creating exactly the expected K! multimodality. Moreover, those rotations are exactly given by permuting the parameter indices and reordering the corresponding parameter values.

You can loosely think of it as cutting out some of the regions in your model parameter space so the sampler dont wonders around and becomes more efficient.

1 Like

Thanks!

So is this generally true of any situation where you have two or more free parameters that must be ordered in a certain way?

Yes. In some occasion where you have strong information (in your prior or likelihood), you might be able to get away with not using a ordered vector. But in general it helps with model convergent.

Bler. I think I’m going to need to sketch that out in order to follow that explanation. Unless someone wants to write a blog post with animated explanatory diagrams :wink:

@drbenvincent did you code up both versions of the model? It might be illuminating to see side-by-side comparisons of them applied to the same data.

No, I just updated and moved on. My primary goal at the moment is to get things working, but maybe this is something I can come back to in order to get a bit more insight