Issues when using MarginalModel on clustering model

Hi, I have a set of observations which are individuals. Each individual is represented by a vector of features which should follow Poisson distributions. My goal is to cluster these individuals.
Each cluster then defines the Poisson parameters of each feature of each individual in the cluster.

This is for example 5 individuals, each of them having 8 features

simu_ind = array([[151, 176, 181, 146, 135, 134, 136, 180],
       [ 62,  80, 132, 105, 138,  88,  70, 108],
       [141, 192, 178, 123, 120, 156, 155, 154],
       [ 17,  41,  16,  18,  23,  21,  13,  11],
       [ 18,  33,  14,  16,  16,  19,  15,  21]])

Let’s assume that we know the Poisson mean of each feature for each cluster (we have here 3 clusters and 5 possible Poisson distributions)

exp_mean = array([[ 18.,  25.,  20.],
        [ 36.,  49.,  40.],
        [ 54.,  74.,  60.],
        [ 72.,  98.,  79.],
        [ 90., 123.,  99.]]))

I came up with this MarginalModel to model my problem

with MarginalModel() as full_model_marg_test:

    cluster_weights = pm.Dirichlet("cluster_weights", a=np.ones(nb_clusters))

    ind2cluster = [pm.Categorical('ind'+str(k), p=cluster_weights) for k in range(nb_ind)] 

    poisson_component_weights = [pm.Dirichlet("poisson_component_weights"+str(k), a=pt.tensor.ones(5)) for k in range(nb_clusters)]
    feat2component = [pm.Categorical('obscomp'+str(idx),p=poisson_component_weights[idx],shape=nb_amp) for idx in range(nb_clusters)] 

    observation = []
    for k in range(nb_clusters):
        observation += [pm.Poisson('obs'+str(k),mu=expect_mean[:,ind2cluster[k]][pt.tensor.as_tensor(feat2component)[ind2cluster[k]]],observed=simu_ind[k,:])]
    
    
    for j in pt.tensor.unique(ind2cluster).eval():
        full_model_marg_test.marginalize([feat2component[j]])

    for k in range(nb_ind):
        full_model_marg_test.marginalize([ind2cluster[k]])

I wanted to use the automatic way to marginalised the discrete latent variables of the model i.e the assignment of individual to cluster ind2cluster and the assignment of feature to their appropriate Poisson distribution given the cluster assignment feat2component.

I tried to do it in the most explicit and splitted way but got this error:

NotImplementedError: Marginalization through operation Join(0, ExpandDims{axis=0}.0, ExpandDims{axis=0}.0, ExpandDims{axis=0}.0) not supported.

NotImplementedError: The graph between the marginalized and dependent RVs cannot be marginalized efficiently. You can try splitting the marginalized RV into separate components and marginalizing them separately.

I then tried to marginalize only ind2cluster given deterministic feat2component and reciprocally.

Given

feat2component = array([[0, 3, 0],
        [0, 3, 1],
        [1, 3, 0],
        [2, 3, 1],
        [2, 3, 1],
        [0, 3, 1],
        [0, 4, 4],
        [1, 0, 1]])

Here it means that for cluster 0, features 0 for every individual of this cluster should follow a poisson distribution with mean exp_mean[:,0][0] = 18

with MarginalModel() as try_marg_ind:
    cluster_weights = pm.Dirichlet("cluster_weights", a=np.ones(nb_clusters))
    ind2cluster = [pm.Categorical('ind'+str(k), p=cluster_weights) for k in range(nb_ind)] 
    observation = pm.Poisson('obs', mu=[pt.shared(expect_mean)[:,ind2cluster[k]][pt.shared(feat2component)[:,ind2cluster[k]]] for k in range(nb_ind)],observed=simu_ind)
    
    for k in range(nb_ind):
        try_marg_ind.marginalize([ind2cluster[k]])

This bit work if I split ind2cluster.

Now, then given,

ind2cluster = array([0, 2, 0, 0, 0])
with MarginalModel() as try_marg:    
    
    poisson_component_weights = [pm.Dirichlet("poisson_component_weights"+str(k), a=pt.tensor.ones(max_cns)) for k in range(nb_clusters)]
    feat2component = [pm.Categorical('obscomp'+str(idx),p=poisson_component_weights[idx],shape=nb_amp) for idx in range(nb_clusters)] 
    
    observation = []
    for k in range(nb_clusters):
        observation += [pm.Poisson('obs'+str(k),mu=expect_mean[:,ind2cluster[k]][feat2component[ind2cluster[k]]],observed=simu_ind[k,:])]
        
        
    for j in np.unique(ind2cluster):

        try_marg.marginalize([feat2component[j]])

It work eventhough I got the following warning:

UserWarning: There are multiple dependent variables in a FiniteDiscreteMarginalRV. Their joint logp terms will be assigned to the first RV: obs0
  warnings.warn

It does not work if I have one Poisson for all observations as I did in try_marg_ind

Can someone help me understand why I can’t marginalize automatically the full model or what I am doing wrong please. I wonder if it’s related to this bit, where I convert a list to a tensor leading to a Join operation ?

pt.tensor.as_tensor(feat2component)

Instead of writing the model with loops, try to define single variables with batch dimensions. Then you won’t need the as_tensor call which is implicitly doing a join. MarginalModel does not allow Join operations at the moment.

The warning in your second case can be safely ignored. But you don’t want to do it that way because it will require many more logp evaluations than in a model witch batch dimensions.

See Distribution Dimensionality — PyMC 5.17.1 documentation

Thanks for your help, I tried this

with MarginalModel() as full_model_marg_test:

    cluster_weights = pm.Dirichlet("cluster_weights", a=np.ones(nb_clusters))

    ind2cluster = pm.Categorical('ind', p=cluster_weights,shape=nb_ind) 

    poisson_component_weights = pm.Dirichlet("poisson_component_weights", a=pt.tensor.ones(shape=(nb_clusters,5))) 
    feat2component = pm.Categorical('obscomp',p=poisson_component_weights,shape=(nb_features,nb_clusters)) 
    
    observation = pm.Poisson('obs', mu=[pt.shared(exp_mean)[:,ind2cluster[k]][feat2component[:,ind2cluster[k]]] for k in range(nb_ind)],observed=simu_ind)

    full_model_marg_test.marginalize([ind2cluster]) 

    full_model_marg_test.marginalize([obs2component])

However now I have this error

ValueError: Partial slicing or indexing of known dimensions not supported.
The graph between the marginalized and dependent RVs cannot be marginalized efficiently. You can try splitting the marginalized RV into separate components and marginalizing them separately.

Which might be because of the two indexing bits here :

[pt.shared(exp_mean)[:,ind2cluster[k]][obs2component[:,ind2cluster[k]]] for k in range(nb_ind)]

I tried versions without part slicing such as

feat2component = [pm.Categorical('obscomp'+str(idx),p=poisson_component_weights[idx,:],shape=nb_features) for idx in range(nb_clusters)]
observation = pm.Poisson('obs', mu=[pm.math.stack(exp_RC_per_cluster)[ind2cluster[k]][pm.math.stack(feat2component)[ind2cluster[k]]] for k in range(nb_ind)],observed=simu_ind)

And still does not work. However because of the pm.math.stack I thought I would get the Join error but did not get it…

You still have a loop in the observation, can you get rid of it with advanced indexing? Also can you provide a fully reproducible snippet? Can’t test it otherwise

Sure, would this work ?

import numpy as np
import pymc as pm
import pytensor as pt
from pymc_experimental import MarginalModel

nb_clusters = 3
nb_ind = 5
nb_features = 8

simu_ind = np.array([[151, 176, 181, 146, 135, 134, 136, 180],
       [ 62,  80, 132, 105, 138,  88,  70, 108],
       [141, 192, 178, 123, 120, 156, 155, 154],
       [ 17,  41,  16,  18,  23,  21,  13,  11],
       [ 18,  33,  14,  16,  16,  19,  15,  21]])

exp_mean = np.array([[ 18.,  25.,  20.],
        [ 36.,  49.,  40.],
        [ 54.,  74.,  60.],
        [ 72.,  98.,  79.],
        [ 90., 123.,  99.]])


with MarginalModel() as full_model_marg_test:

    cluster_weights = pm.Dirichlet("cluster_weights", a=pt.tensor.ones(nb_clusters))

    ind2cluster = pm.Categorical('ind', p=cluster_weights,shape=nb_ind) 

    poisson_component_weights = pm.Dirichlet("poisson_component_weights", a=pt.tensor.ones(shape=(nb_clusters,5))) 
    feat2component = pm.Categorical('obscomp',p=poisson_component_weights,shape=(nb_features,nb_clusters)) 

    columns = ind2cluster*pt.tensor.ones(shape=(nb_features,nb_ind))
    row_feat = (pt.tensor.ones(shape=(nb_ind,nb_features))*pt.tensor.arange(nb_features)).T
    rows = feat2component[row_feat.astype(dtype=np.intp),columns.astype(dtype=np.intp)]
    all_clusters_parameters = pt.tensor.as_tensor(exp_mean)[rows.astype(dtype=np.intp),columns.astype(dtype=np.intp)] 

    observation = pm.Poisson('obs',mu=all_clusters_parameters, observed=simu_ind)
    
    full_model_marg_test.marginalize([ind2cluster]) 

    full_model_marg_test.marginalize([feat2component])

When running it I now have this error:

ValueError: Partial slicing or advanced integer indexing of known dimensions not supported.
NotImplementedError: The graph between the marginalized and dependent RVs cannot be marginalized efficiently. You can try splitting the marginalized RV into separate components and marginalizing them separately.