What is the mu that goes into NormalMixture? It does not seem to be defined in the model. Also how many clusters are we talking about? Is it really 4000 or did you broadcast your weights into some shape compatible with the data and that is why it is so big? Cant tell what is going on in your dirichlet definition cause I dont know the shape of logalpha but for instance if you have 2000 datapoints and you want to define two clusters per point with possibly weights depending on data points that should not be a major cause of slow down. That would mean your weights will have shape 2000,2 (I think) which is what a weight of shape (2,) would have been broadcasted to too. Perhaps you can write the shape of your observed, mu and weights for us to get a better idea (or better yet post a complete code).
In my experience for instance when running about 200 data points, 9 clusters on a six dimensional problem with ~3000 draws and tune can take like up to an hour on a pretty decent computer. Is your problem 1D? Never worked with 1D problems much but maybe it is more tame there.
Also normal mixtures are very prone to mixing so you need to use tricks like ordering (and for higher dimensional problems, on a possibly choice of coordinates more suited for your problem), see for instance:
Also to possibly understand whether or not it is a problem with your data not being compatible with the model or not, you can maybe generate data using sample_prior_predictive and use the data generated to train the model and see if it runs faster. See: