Trying to get a simple mixture model of MvNormal dists working. I have two components, with true mixture weights of 0.6,0.4 (in the model I use symmetric-Dirichlet as the prior on the weights).
Both components use MvNormal.dist() as the component distribution. Both use the same covariance matrix, which I give the true values (not trying to learn the covs). The Mu for both dists, I model as two separate Gaussian Processes but having the same zero mean and squared-exp cov function of the same length scale (again, I give the actual cov matrix).
I draw 10 Thetas from the mixture, and for each of the 10 I randomly, exhaustively measure over the input space. I provide the likelihood of those measurements as Normal dist, with mean at that Theta’s input location and sd of the measurement noise (which is small in this toy problem).
I also draw an extra 3 Thetas that don’t get any measurements. Total of 13 Thetas (10 with exhaustive measurements, 3 without any measurements).
the main issue is that for those extra Thetas (without measurements), it’s almost as if the sampler has decided to assign each of them almost exclusively to one of the mixture components! Why would it do that?? In the code notebook link above, I plot the last 1000 samples from the trace for all three of these, and they are all almost exclusively only showing samples from one of the mixture components. And the sampler (on average) does show the correct mixture. So why isn’t the sampler having 60% of the samples like one component, and 40% of the samples like the other?? That is the correct behavior, in my understanding of a mixture model.
Is there any way I can improve the model coding – more efficiency/more vectorization? I’m guessing it is not possible, in the way I have my model set-up. But just was curious if there was more I could do to make it better.
(minor thing) Even though the average weight values from the scalar are correct, they vary quite a lot over the samples. I guess this is because there are “only” 10 total Thetas from the mixture, which seems like a decent amount to me, but maybe it needs hundreds to cut the variance down significantly?
A brief answer (sorry I dont have much time to dig deep into it):
Yes this is normal sampling from a mixture. There is no good solution in high dimension unfortunately. One possible is to compute the posterior covariance matrix of all mixture component and use that as mass matrix. Because otherwise NUTS will just adapt to the local geometry and explore one local mode. But doing that you might see sampling problem like poor mixing etc.
Unfortunately you need to use a for loop as we dont have good support for high dimensional multivariate normal.
Thanks @junpenglao ! Understand you don’t have time to dig, but could you try to provide a bit more detail from what you know off the top of your head (without digging deep into math or the code)?
“Yes this is normal sampling from a mixture. There is no good solution in high dimension unfortunately.”
My input space in this toy example only has 3 locations (in one spatial dimension), so it’s not really high dimension. Unless you’re talking the size of the model (# of Thetas)? I don’t understand why the sampler would not have the trace for those “extra” Thetas (without measurements) be 60% of one component and 40% of the other, instead of what looks like 99.9% of one component and 0.1% of the other.
“NUTS will just adapt to the local geometry and explore one local mode”
This must be the core issue, but I don’t understand what this means. Can a different sampler do this in pymc3, other than NUTS? Or can you suggest if I can try a different package altogether (Stan?), although I really prefer to stay with pymc3 since I’m comfortable in it!
What I’m trying to do basically is show that as you incorporate more and more measurements of the overall population (Thetas) – ie, as you gain more and more knowledge of the mixture weights and sub-population means – that the uncertainty of the “next” (N+1) Theta (with no measurements) drops down more and more until it essentially approaches the limit, which would be like perfect knowledge of the mixture.