Rejecting/thinning modes based on average density


#1

If I have two well separated modes, they are essentially exhaustive mutually exclusive. In this case, at least naively, I can approximate the mode probability masses by averaging the density over the sample.

Is it a legitimate technique to reject chains stuck in a less probable mode based on relative mass approximated as above e.g. if the mass ratio is larger than the sample size?

Further is it legitimate to thin out the chains based on mode mass ratio approximated as above?


#2

I dont think that is the appropriate way to do. If you have different chains completely stuck in different modes, how can you be sure the density is estimated correctly? The estimation would be biased to begin with.


#3

Just to make sure that we are on the same page: by density I mean exp(logp()) at sample points.


#4

So you are trying to scale the marginal posterior (e.g., smoothed histogram) by the exp(logp())? Besides some simple cases, I dont think this approach is sounded.
Just take a simple example:

with pm.Model():
    pm.NormalMixture('m', 
        mu=np.array([0., 5.]), 
        w=np.array([.5, .5]), 
        sd=np.array([1., 1.]))
    trace = pm.sample(1000)
pm.traceplot(trace);

As an example, take the yellow chain and the green chain, the logp(x) at each point would be the same even though the shape marginal density is different. And the situation would just get worse in higher dimension.


#5

You are right.
Also just checked my maths - what I proposed is not the right way to do it.


#6

But it seems that I can get what I want if I can estimate the volume of mode support, he-he-he…


#7

I guess depends on what you plan to do. If you only care about minimizing prediction error, focusing on a single posterior mode does not necessary gives you bad performance (think of MLE or Laplace approximation, you essentially ignore all other smaller mode and use what hopefully the global maximum).


#8

First and foremost I want to compare mode masses. I have chains sampling in regions around two different solutions: do they have comparable probability or one of them is just a small bump compared to the other and we can safely ignore it?

BTW I now use ‘advi+adapt_diag’ for init. What do you think: can disabling the adapt_diag part help switching between modes?


#9

Maybe you can try SMC, it might gives a better weighting with lots of chains.

advi tend to underestimate the variance, so if anything you should try with the default jitter+adapt_diag. Ideally you want to have large enough energy proposal to mix across different mode. It might be interesting to try adapt_diag_grad as initialization.