Unsupervised clustering: estimating number & type of subgroups

I’ve used Eric’s notebook to simulate a mixture of 2 Poisson distributions.

grp1 = np.random.poisson(lam=lams[0], size=sizes[0])
grp2 = np.random.poisson(lam=lams[1], size=sizes[1])

mixture = np.concatenate([grp1, grp2])

Then I built the model and got pretty good posteriors for the two clusters.

But in real life, the number of subgroups in the data or the distribution type of each subgroup is often not obvious.

Should I assume that the subgroups/clusters would follow the same distribution as the observed data?

What are the best ways to identify or estimate the number of subgroups/clusters present and the type of distributions they follow?

Usually, you can model this with Dirichlet process mixtures: http://docs.pymc.io/notebooks/dp_mix.html. But similar to other mixture models, inferencing these kinds of models are difficult and care must be taken.

As for the underlying distributions, I have no good answer for it as well - but unless you have a strong theoretical motivation or lots of data, the result probably indistinguishable with different distributions as long as they have a similar shape (e.g, using a Student t instead of Normal).

@junpenglao I am slightly confused about the sentence above. Could you please clarify.

Ok, so determining the number of clusters in a data set is sometimes really hard. But let’s say I make one good guess on the number of components in Dirichlet process and the model converges with good autocorrelation.

Does that ALWAYS mean that the model found an accurate number of clusters, or could the model converge while still being far from the true number of clusters?

What I meant is there is that, if you have some data that is a mixture of Gaussian and Gamma distribution, my hunch is that it would be quite difficult to distinguish that with a mixture of Gaussian and Lognormal distribution.

As for estimating the number of clusters, depends on the inference you are using: If you are sampling then the mean number of components gives an estimation of the expected number of cluster; if you are doing MLE or MAP then it gives the most likely number of components (both case conditioned on your model and data). It might not be very useful to think that the model found the true number of cluster or not, as in practices you might never know.

1 Like