Why does MAP vastly outperform sample in bayesian clustering?

jessegrabowski · October 31, 2023, 12:30am

Typically, if you are getting better results with a “simpler” method (MAP, metropolis, etc), your model has some kind of mis-specification. One of the best and most frustrating things about NUTS is that it fails very loudly.

I’m a beginner when it comes to mixture models, but I spent some time playing with your model. As written, you end up with really nonsense results in the non-sorted dimension. After some consultation with @ricardoV94 , I think it’s really important that you use a mixture of Multivariate Normals, rather than a mixture of Normals. My intuition (and likely yours) was that these are equivalent, but in the context of a mixture model they’re not. Using a mixture all Normals is like a mixture of 10 exchangable components, whereas a mixture of 5 MultivariateNormals encodes an unbreakable connection between each (x, y) pair.

After that change things got much more sensible. To get it across the finish line I had to tinker with priors and sampling options. I needed quite tight sigma priors to makes sure the distributions didn’t “bleed” into each other. I also got better results with the pymc NUTS implementation than jax, but I don’t have a good explanation for that. My hypothesis is that the uni-modal nature of ADVI pushed the mixture weights towards 1/n, which ends up being the right answer. Could also experiment with setting the concentration parameter higher. Without using the ADVI initialization, the sampler often found sparse solutions, which we really don’t want (in this problem anyway).

Here’s the data generation:

n_clusters = 5
data, labels = make_blobs(n_samples=1000, centers=n_clusters, random_state=10)

scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
plt.scatter(*scaled_data.T, c=labels)

And the model:

coords={"cluster": np.arange(n_clusters),
        "obs_id": np.arange(data.shape[0]),
        "coord":['x', 'y']}

# When you use the ordered transform, the initial values need to be 
# monotonically increasing
sorted_initvals = np.linspace(-2, 2, 5)

with pm.Model(coords=coords) as model:
    # Use alpha > 1 to prevent the model from finding sparse solutions -- we know
    # all 5 clusters should be represented in the posterior
    w = pm.Dirichlet("w", np.full(n_clusters, 10), dims=['cluster'])
    
    # Mean component
    x_coord = pm.Normal("x_coord", sigma=1, dims=["cluster"], 
                        transform=pm.distributions.transforms.ordered,
                        initval=sorted_initvals)
    y_coord = pm.Normal('y_coord', sigma=1, dims=['cluster'])
    centroids = pm.Deterministic('centroids', pt.concatenate([x_coord[None], y_coord[None]]).T,
                                dims=['cluster', 'coord'])
    
    # Diagonal covariances. Could also model the full covariance matrix, but I didn't try.
    sigma = pm.HalfNormal('sigma', sigma=1, dims=['cluster', 'coord'])
    covs = [pt.diag(sigma[i]) for i in range(n_clusters)]
    
    # Define the mixture
    components = [pm.MvNormal.dist(mu=centroids[i], cov=covs[i]) for i in range(n_clusters)]
    y_hat = pm.Mixture("y_hat",
                       w,
                       components,
                       observed=scaled_data,
                       dims=["obs_id", 'coord'])

    idata = pm.sample(init='advi+adapt_diag')

Results:

If you’re interested in the probability of each point, you can use a predictive model to go back and compute that:

with pm.Model(coords=coords) as likelihood_model:
    w = pm.Flat("w", dims=['cluster'])
    
    x_coord = pm.Flat("x_coord", dims=["cluster"])
    y_coord = pm.Flat('y_coord', dims=['cluster'])
    centroids = pm.Deterministic('centroids', pt.concatenate([x_coord[None], y_coord[None]]).T,
                                dims=['cluster', 'coord'])
    
    sigma = pm.Flat('sigma', dims=['cluster', 'coord'])
    covs = [pt.diag(sigma[i]) for i in range(n_clusters)]
    
    components = [pm.MvNormal(f'component_{i}', mu=centroids[i], cov=covs[i]) for i in range(n_clusters)]
    p_component = pm.Deterministic('p_component', pt.concatenate([pm.logp(d, scaled_data)[:, None] for d in components],
                                                        axis=-1),
                           dims=['obs_id', 'cluster'])
    label_hat = pm.Deterministic('label_hat', pt.argmax(p_component, axis=-1), dims=['obs_id'])
    idata_pred = pm.sample_posterior_predictive(idata, var_names=['p_component', 'label_hat'])

Topic		Replies	Views
Mixture of Censored iid Normals v5	24	532	March 5, 2024
Version dependant slowing down of Gaussian Mixture sampling in Ubuntu 20.04	44	711	November 7, 2023
2D Gaussian Mixture v5	5	994	October 31, 2023
When treating a small data, how to choose a model modeling	3	339	June 2, 2023
Hierarchical Mixture of Multivariate Normal Questions	0	587	September 29, 2021

Why does MAP vastly outperform sample in bayesian clustering?

Related topics