Continuing on from here:
where I asked about mixture of normals, thanks to @jessegrabowski and @ricardoV94, I have come to realize that due to sampling methods, in mixture models, MvNormals with diagonal cov is not identical to a bunch of univariate normals and actually can sample more efficiently (somewhat remedying label mixing if not completely).
However as it stands out, MvNormals is not applicable to where I want to use the mixture model in. Following again from @ricardoV94’s question I wanted to open a seperate topic for this. A summary of how the data would look like in may case can be described as follows:
1- Assume we are given a cluster’s center coordinates, a n dimensional vector, values between 0 to 10.
2- Any data i around this center is given as center + bias_i + (iid noise on each coordinate)_i. Moreover there is thresholding, any value below 1 is set 1.
Here is how I would generate it:
import numpy as np import pymc as pm import pytensor.tensor as ptt import matplotlib.pyplot as plt def plot_clusters(centers, data, max_val): nclusters = centers.shape fig,ax = plt.subplots(1, nclusters, figsize=(5*nclusters, 5)) for i in range(nclusters): ax[i].plot(centers[i,:], color="red", zorder=1) ax[i].plot(data[i].T, color="black", alpha=0.3, zorder=0) ax[i].set_ylim(0, max_val+3) ax[i].set_yticks(range(0, max_val+3)) ax[i].grid("on", alpha=0.3) ndims = 7 nclusters = 5 max_val = 8 lower = 1 bias_sd = 1 noise_sd = 0.5 upper = np.inf seed = 0 rng = np.random.default_rng(seed) ndata_per_cluster = rng.integers(20,50, size=nclusters) ndata = int(np.sum(ndata_per_cluster)) cluster_centers = rng.random((nclusters, ndims))*max_val data =  for i in range(nclusters): biases = rng.normal(0, bias_sd, (ndata_per_cluster[i],))[:,None] noise = rng.normal(0, noise_sd, (ndata_per_cluster[i], ndims)) cluster_data = cluster_centers[i:i+1,:] + biases + noise cluster_data[cluster_data<lower] = lower data.append(cluster_data) plot_clusters(cluster_centers, data, max_val) data = np.concatenate(data)
So the model that I tried to use initially was as such (assuming I get init and center_mus from k-means or sth but for now I will use the true centers):
#normally init would come from spectral or k-means clustering with some #bias correction on the data cluster_centers = cluster_centers[np.argsort(cluster_centers[:,0]),:] cluster_centers[cluster_centers<lower] = lower init = [cluster_centers[:,0], cluster_centers[:,1:]] center_mus = [cluster_centers[:,0], cluster_centers[:,1:]] mu_sigma = 1 nclusters_fit = nclusters bias_sd = 2 conc = 10 with pm.Model() as model: μ0 = pm.Normal("μ0", mu=center_mus, sigma=mu_sigma, shape=nclusters_fit, transform=pm.distributions.transforms.ordered, initval=init) μ1 = pm.Normal("μ1", mu=center_mus, sigma=mu_sigma, shape=(nclusters_fit, ndims-1), initval=init) μ = pm.Deterministic("μ", ptt.concatenate([μ0[:,None], μ1], axis=1)) σ = pm.InverseGamma("σ", 2, 1) biases = pm.Normal("biases", 0, bias_sd, shape=(ndata)) weights = pm.Dirichlet("w", conc*np.ones(nclusters_fit)) components =\ [pm.Normal.dist(μ[i,:][None,:]+biases[:,None], σ) for i in range(nclusters_fit)] mix = pm.Mixture.dist(w=weights, comp_dists=components) pm.Censored("likelihood", mix, lower=lower, upper=upper, observed=data) trace = pm.sample(draws=2000, tune=1000, chains=6, target_accept=0.95)
There is quite a bit of cheating here I realize. But I did this as a proof of concept and even though I supplied censored centres as initials, the model correctly guesses centre components that are below 1. So censoring is indeed doing what it is supposed to be doing. Multi-modality of posteriors is not horrible, some here and there. But I know again I am cheating here by supplying true centres though in principle centers from k-means or spectral clustering work just as fine (after some preprocessing for the bias).
The question is how to make this model less prone to centre switching without cheating? I can still supply initials from k-means etc maybe but seems like non-mixing property of MvNormal would also be a quite boon in this direction.
Note that I am fine with using a MvNormal with a diagonal cov in which components are independent (here they are actually iid). In principle I guess one could custom define censored for such a MvNormal. Whenever one component is below lower, likelihood would be CDF of that component multiplied by PDF of the rest. But
1- I don’t know really where to start, I don’t know inner workings of pymc that well to understand how to construct this,
2- If I define a custom distribution like this, would I be again beating the non-mixing of MvNormal?
ps: As a side note, if I deal with less number of clusters such as 3, just supplying the initial values as centers (say from k-means) but keeping the center_mus vague (i.e all 4) with large mu_sigma like 5 also works fine when using init=‘advi+adapt_diag’. Haven’t now tested without advi but feel like that makes a positive impact based on previous experience.