Mixture of Linear models

Actually VI does not avoid label switching (it does not have that concept to begin with), if you dont restrict the likelihood/posterior geometry, VI would try to approximate the same multimodal space (for a simple demonstration, see https://github.com/junpenglao/All-that-likelihood-with-PyMC3/blob/master/Notebooks/Normal_mixture_logp.ipynb). Our current VI (ie meanfield and fullrank ADVI) does not do what you intend to (ie., only approximate a single mode) - even if sometimes it display such behaviour it is out of chance. Some example of difficult of the current VI having difficulty you can see in this notebook (at the very end): https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/master/WIP/[WIP]%20Bayesian%20GMM.ipynb

My suggestion is to use sampling with a single chain and study the fit first, and then take it from there. A similar mixture regression example you can have a look is: Gaussian Mixture of regression (notebook here: https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/master/PyMC3QnA/mixture/Mixture_discourse.ipynb and https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/master/PyMC3QnA/mixture/Mixture_discourse_order.ipynb, the second one also use a VI and you can see it is not doing a good job).

Last note - VI inference on mixture model classically use a conjugate gradient method, which my intuition is that it forces the VI approximation to converge toward a single mode. Mansfield and fullrank that update all parameters at the same time during inference do not archive that.