Ok funny thing, sampling with MV works fine when I use pymc 5.0 (though some of the other suggestions in the links dont make much of a difference, though havent digged too deep). I am going to try to pinpoint what is the latest version that works fine and file a report. While I do that, any ideas why MvNormal with diag covariance and a set of independent normals (which are mathematically equivalent) would generate different results? So when I test your code in pymc 5.0
1- Using the same code as above both sampling with only pm.sample() still results in very multimodal results for both x and y components. So advi does play a role in getting those nice results and I think it was mentioned else where somewhere in this forum that advi might be more appropriate for mixture models.
2- sampling with pm.sample(init=‘advi+adapt_diag’) does result in the unimodal and better results that you have shown. So it was only natural to try pm.sample(init=‘advi+adapt_diag’) with a set of independent univariate normals however the result I get is again as before quite multimodal with a lot of divergences. Setting target_accept=0.99 gets somewhat better results similar to advi but still not as good. But still this suggests me that most of the improvement is coming from advi however I am really suprised that there is really any difference between a MV normal with diag correlation and a set of independent normals. I will investigate this more.
3- sampling with advi and univariate normals but with a single sigma parameter:
sigma = pm.InverseGamma('sigma', alpha=1, beta=2)
and target_accept=0.95 gives almost as good results for x components as MvNormal and the original code (y components still multimodal).