In my dataset, the observations are grouped under a few categories (sample_type
in the code below), and I want to find clusters within each category. I also believe that some clusters may appear in more than one category. I assumed a Dirichlet distribution for within-cluster variation. With that in mind, I started building my model:
n, k, c, t = 30, 6, 2, 5
sample_type = np.random.choice(range(t), n)
with pm.Model() as model:
cluster_profiles = pm.Exponential("Cluster profile ratio", 1, shape=(k,c))
cluster_weights = pm.Dirichlet("Cluster weights", np.ones((t,c))/2, shape=(t,c))
components = pm.Dirichlet.dist(a=cluster_profiles, shape=(k, c))
Where k
is the dimensionality of my observations, n
is how many observations I have, c
is how many clusters I am looking for and t
is the number of possible categories. So far so good. Then, I added the mixture part:
with model:
pm.MixtureSameFamily("Tumor-based prior", w=cluster_weights[ttype], comp_dists=components, shape=(n,k))
But it throws a ValueError: Input dimension mis-match. (input[0].shape[0] = 30, input[1].shape[0] = 6)
. I believe it is because cluster_weights
is supposed to be have shape c
, not (n, c)
. If that is the case, how can I implement a mixture with observation-specific weights? Am I using the Mixture module correctly?
I am using PyMC3 v3.11.2 on a Google Colab Linux instance. I have already tried replacing pm.MixtureSameFamily
by pm.Mixture
, but it did not help.
Thanks!