Ha, funny I am also working on a case study with similar setup.
rhat is high usually because of label switching. In 1D it is relatively easier to constrain the mu
to be ordered, however, in 2D+ this is impossible
For now, my suggestion is that run single chain, and check convergence for each of them, and then assign label by hand. In my experience usually NUTS will explore only one mode (i.e., usually no label switching in a single chain), but across chain the label could be different. Combine this with some relabelling technique it should give you something sensible.
But a proper treatment you need to reparameterized it. To the best of my knowledge, Gaussian mixture model parameterization is best explained in https://arxiv.org/pdf/1601.01178.pdf which I am currently working on You should check out the blog post by Christian Roberts on this https://xianblog.wordpress.com/2016/01/11/mixtures-are-slices-of-an-orange/. His other blog posts on mixtures are also great https://xianblog.wordpress.com/?s=mixture
FYI, you can use pm.Mixture
for multinormal distribution as well now.