Unsupervised clustering: estimating number & type of subgroups

adam · May 23, 2018, 3:52am

I’ve used Eric’s notebook to simulate a mixture of 2 Poisson distributions.

grp1 = np.random.poisson(lam=lams[0], size=sizes[0])
grp2 = np.random.poisson(lam=lams[1], size=sizes[1])

mixture = np.concatenate([grp1, grp2])

Then I built the model and got pretty good posteriors for the two clusters.

But in real life, the number of subgroups in the data or the distribution type of each subgroup is often not obvious.

Should I assume that the subgroups/clusters would follow the same distribution as the observed data?

What are the best ways to identify or estimate the number of subgroups/clusters present and the type of distributions they follow?

junpenglao · May 23, 2018, 5:13am

Usually, you can model this with Dirichlet process mixtures: http://docs.pymc.io/notebooks/dp_mix.html. But similar to other mixture models, inferencing these kinds of models are difficult and care must be taken.

As for the underlying distributions, I have no good answer for it as well - but unless you have a strong theoretical motivation or lots of data, the result probably indistinguishable with different distributions as long as they have a similar shape (e.g, using a Student t instead of Normal).

adam · May 23, 2018, 12:45pm

@junpenglao I am slightly confused about the sentence above. Could you please clarify.

Ok, so determining the number of clusters in a data set is sometimes really hard. But let’s say I make one good guess on the number of components in Dirichlet process and the model converges with good autocorrelation.

Does that ALWAYS mean that the model found an accurate number of clusters, or could the model converge while still being far from the true number of clusters?

junpenglao · May 23, 2018, 1:25pm

What I meant is there is that, if you have some data that is a mixture of Gaussian and Gamma distribution, my hunch is that it would be quite difficult to distinguish that with a mixture of Gaussian and Lognormal distribution.

As for estimating the number of clusters, depends on the inference you are using: If you are sampling then the mean number of components gives an estimation of the expected number of cluster; if you are doing MLE or MAP then it gives the most likely number of components (both case conditioned on your model and data). It might not be very useful to think that the model found the true number of cluster or not, as in practices you might never know.

Topic		Replies	Views
Poisson Mixture Model & Label Switching v5	7	1376	August 3, 2022
Mixture model with sample specific weights Questions bug	1	649	April 12, 2023
Mixture model with variable number of component distributions Questions	2	1538	November 29, 2018
Dirichlet Mixture model Questions	3	1030	June 4, 2018
Dirichlet process mixtures bidimensional data Questions	16	1164	February 13, 2019

Unsupervised clustering: estimating number & type of subgroups

Related topics