Attaching the data used to clarify why I said the samples aren’t too bad:
The posterior weights indicate only 2 significant components, and the posterior means would be in the right place if they weren’t swapped.