Continuing on from here:
where I asked about mixture of normals, thanks to @jessegrabowski and @ricardoV94, I have come to realize that due to sampling methods, in mixture models, MvNormals with diagonal cov is not identical to a bunch of univariate normals and actually can sample more efficiently (somewhat remedying label mixing if not completely).
However as it stands out, MvNormals is not applicable to where I want to use the mixture model in. Following again from @ricardoV94’s question I wanted to open a seperate topic for this. A summary of how the data would look like in may case can be described as follows:
1- Assume we are given a cluster’s center coordinates, a n dimensional vector, values between 0 to 10.
2- Any data i around this center is given as center + bias_i + (iid noise on each coordinate)_i. Moreover there is thresholding, any value below 1 is set 1.
Here is how I would generate it:
import numpy as np
import pymc as pm
import pytensor.tensor as ptt
import matplotlib.pyplot as plt
def plot_clusters(centers, data, max_val):
nclusters = centers.shape[0]
fig,ax = plt.subplots(1, nclusters, figsize=(5*nclusters, 5))
for i in range(nclusters):
ax[i].plot(centers[i,:], color="red", zorder=1)
ax[i].plot(data[i].T, color="black", alpha=0.3, zorder=0)
ax[i].set_ylim(0, max_val+3)
ax[i].set_yticks(range(0, max_val+3))
ax[i].grid("on", alpha=0.3)
ndims = 7
nclusters = 5
max_val = 8
lower = 1
bias_sd = 1
noise_sd = 0.5
upper = np.inf
seed = 0
rng = np.random.default_rng(seed)
ndata_per_cluster = rng.integers(20,50, size=nclusters)
ndata = int(np.sum(ndata_per_cluster))
cluster_centers = rng.random((nclusters, ndims))*max_val
data = []
for i in range(nclusters):
biases = rng.normal(0, bias_sd, (ndata_per_cluster[i],))[:,None]
noise = rng.normal(0, noise_sd, (ndata_per_cluster[i], ndims))
cluster_data = cluster_centers[i:i+1,:] + biases + noise
cluster_data[cluster_data<lower] = lower
data.append(cluster_data)
plot_clusters(cluster_centers, data, max_val)
data = np.concatenate(data)
So the model that I tried to use initially was as such (assuming I get init and center_mus from k-means or sth but for now I will use the true centers):
#normally init would come from spectral or k-means clustering with some
#bias correction on the data
cluster_centers = cluster_centers[np.argsort(cluster_centers[:,0]),:]
cluster_centers[cluster_centers<lower] = lower
init = [cluster_centers[:,0], cluster_centers[:,1:]]
center_mus = [cluster_centers[:,0], cluster_centers[:,1:]]
mu_sigma = 1
nclusters_fit = nclusters
bias_sd = 2
conc = 10
with pm.Model() as model:
μ0 = pm.Normal("μ0",
mu=center_mus[0],
sigma=mu_sigma,
shape=nclusters_fit,
transform=pm.distributions.transforms.ordered,
initval=init[0])
μ1 = pm.Normal("μ1",
mu=center_mus[1],
sigma=mu_sigma,
shape=(nclusters_fit, ndims-1),
initval=init[1])
μ = pm.Deterministic("μ", ptt.concatenate([μ0[:,None], μ1], axis=1))
σ = pm.InverseGamma("σ", 2, 1)
biases = pm.Normal("biases", 0, bias_sd, shape=(ndata))
weights = pm.Dirichlet("w", conc*np.ones(nclusters_fit))
components =\
[pm.Normal.dist(μ[i,:][None,:]+biases[:,None], σ) for i in range(nclusters_fit)]
mix = pm.Mixture.dist(w=weights, comp_dists=components)
pm.Censored("likelihood", mix, lower=lower, upper=upper, observed=data)
trace = pm.sample(draws=2000, tune=1000, chains=6, target_accept=0.95)
There is quite a bit of cheating here I realize. But I did this as a proof of concept and even though I supplied censored centres as initials, the model correctly guesses centre components that are below 1. So censoring is indeed doing what it is supposed to be doing. Multi-modality of posteriors is not horrible, some here and there. But I know again I am cheating here by supplying true centres though in principle centers from k-means or spectral clustering work just as fine (after some preprocessing for the bias).
The question is how to make this model less prone to centre switching without cheating? I can still supply initials from k-means etc maybe but seems like non-mixing property of MvNormal would also be a quite boon in this direction.
Note that I am fine with using a MvNormal with a diagonal cov in which components are independent (here they are actually iid). In principle I guess one could custom define censored for such a MvNormal. Whenever one component is below lower, likelihood would be CDF of that component multiplied by PDF of the rest. But
1- I don’t know really where to start, I don’t know inner workings of pymc that well to understand how to construct this,
2- If I define a custom distribution like this, would I be again beating the non-mixing of MvNormal?
ps: As a side note, if I deal with less number of clusters such as 3, just supplying the initial values as centers (say from k-means) but keeping the center_mus vague (i.e all 4) with large mu_sigma like 5 also works fine when using init=‘advi+adapt_diag’. Haven’t now tested without advi but feel like that makes a positive impact based on previous experience.