Issues when using MarginalModel on clustering model

Sopchi · November 14, 2024, 12:27pm

Hi, I have a set of observations which are individuals. Each individual is represented by a vector of features which should follow Poisson distributions. My goal is to cluster these individuals.
Each cluster then defines the Poisson parameters of each feature of each individual in the cluster.

This is for example 5 individuals, each of them having 8 features

simu_ind = array([[151, 176, 181, 146, 135, 134, 136, 180],
       [ 62,  80, 132, 105, 138,  88,  70, 108],
       [141, 192, 178, 123, 120, 156, 155, 154],
       [ 17,  41,  16,  18,  23,  21,  13,  11],
       [ 18,  33,  14,  16,  16,  19,  15,  21]])

Let’s assume that we know the Poisson mean of each feature for each cluster (we have here 3 clusters and 5 possible Poisson distributions)

exp_mean = array([[ 18.,  25.,  20.],
        [ 36.,  49.,  40.],
        [ 54.,  74.,  60.],
        [ 72.,  98.,  79.],
        [ 90., 123.,  99.]]))

I came up with this MarginalModel to model my problem

with MarginalModel() as full_model_marg_test:

    cluster_weights = pm.Dirichlet("cluster_weights", a=np.ones(nb_clusters))

    ind2cluster = [pm.Categorical('ind'+str(k), p=cluster_weights) for k in range(nb_ind)] 

    poisson_component_weights = [pm.Dirichlet("poisson_component_weights"+str(k), a=pt.tensor.ones(5)) for k in range(nb_clusters)]
    feat2component = [pm.Categorical('obscomp'+str(idx),p=poisson_component_weights[idx],shape=nb_amp) for idx in range(nb_clusters)] 

    observation = []
    for k in range(nb_clusters):
        observation += [pm.Poisson('obs'+str(k),mu=expect_mean[:,ind2cluster[k]][pt.tensor.as_tensor(feat2component)[ind2cluster[k]]],observed=simu_ind[k,:])]
    
    
    for j in pt.tensor.unique(ind2cluster).eval():
        full_model_marg_test.marginalize([feat2component[j]])

    for k in range(nb_ind):
        full_model_marg_test.marginalize([ind2cluster[k]])

I wanted to use the automatic way to marginalised the discrete latent variables of the model i.e the assignment of individual to cluster ind2cluster and the assignment of feature to their appropriate Poisson distribution given the cluster assignment feat2component.

I tried to do it in the most explicit and splitted way but got this error:

NotImplementedError: Marginalization through operation Join(0, ExpandDims{axis=0}.0, ExpandDims{axis=0}.0, ExpandDims{axis=0}.0) not supported.

NotImplementedError: The graph between the marginalized and dependent RVs cannot be marginalized efficiently. You can try splitting the marginalized RV into separate components and marginalizing them separately.

I then tried to marginalize only ind2cluster given deterministic feat2component and reciprocally.

Given

feat2component = array([[0, 3, 0],
        [0, 3, 1],
        [1, 3, 0],
        [2, 3, 1],
        [2, 3, 1],
        [0, 3, 1],
        [0, 4, 4],
        [1, 0, 1]])

Here it means that for cluster 0, features 0 for every individual of this cluster should follow a poisson distribution with mean exp_mean[:,0][0] = 18

with MarginalModel() as try_marg_ind:
    cluster_weights = pm.Dirichlet("cluster_weights", a=np.ones(nb_clusters))
    ind2cluster = [pm.Categorical('ind'+str(k), p=cluster_weights) for k in range(nb_ind)] 
    observation = pm.Poisson('obs', mu=[pt.shared(expect_mean)[:,ind2cluster[k]][pt.shared(feat2component)[:,ind2cluster[k]]] for k in range(nb_ind)],observed=simu_ind)
    
    for k in range(nb_ind):
        try_marg_ind.marginalize([ind2cluster[k]])

This bit work if I split ind2cluster.

Now, then given,

ind2cluster = array([0, 2, 0, 0, 0])

with MarginalModel() as try_marg:    
    
    poisson_component_weights = [pm.Dirichlet("poisson_component_weights"+str(k), a=pt.tensor.ones(max_cns)) for k in range(nb_clusters)]
    feat2component = [pm.Categorical('obscomp'+str(idx),p=poisson_component_weights[idx],shape=nb_amp) for idx in range(nb_clusters)] 
    
    observation = []
    for k in range(nb_clusters):
        observation += [pm.Poisson('obs'+str(k),mu=expect_mean[:,ind2cluster[k]][feat2component[ind2cluster[k]]],observed=simu_ind[k,:])]
        
        
    for j in np.unique(ind2cluster):

        try_marg.marginalize([feat2component[j]])

It work eventhough I got the following warning:

UserWarning: There are multiple dependent variables in a FiniteDiscreteMarginalRV. Their joint logp terms will be assigned to the first RV: obs0
  warnings.warn

It does not work if I have one Poisson for all observations as I did in try_marg_ind

Can someone help me understand why I can’t marginalize automatically the full model or what I am doing wrong please. I wonder if it’s related to this bit, where I convert a list to a tensor leading to a Join operation ?

pt.tensor.as_tensor(feat2component)

ricardoV94 · November 14, 2024, 12:40pm

Instead of writing the model with loops, try to define single variables with batch dimensions. Then you won’t need the as_tensor call which is implicitly doing a join. MarginalModel does not allow Join operations at the moment.

The warning in your second case can be safely ignored. But you don’t want to do it that way because it will require many more logp evaluations than in a model witch batch dimensions.

See Distribution Dimensionality — PyMC 5.17.1 documentation

Sopchi · November 14, 2024, 3:49pm

Thanks for your help, I tried this

with MarginalModel() as full_model_marg_test:

    cluster_weights = pm.Dirichlet("cluster_weights", a=np.ones(nb_clusters))

    ind2cluster = pm.Categorical('ind', p=cluster_weights,shape=nb_ind) 

    poisson_component_weights = pm.Dirichlet("poisson_component_weights", a=pt.tensor.ones(shape=(nb_clusters,5))) 
    feat2component = pm.Categorical('obscomp',p=poisson_component_weights,shape=(nb_features,nb_clusters)) 
    
    observation = pm.Poisson('obs', mu=[pt.shared(exp_mean)[:,ind2cluster[k]][feat2component[:,ind2cluster[k]]] for k in range(nb_ind)],observed=simu_ind)

    full_model_marg_test.marginalize([ind2cluster]) 

    full_model_marg_test.marginalize([obs2component])

However now I have this error

ValueError: Partial slicing or indexing of known dimensions not supported.
The graph between the marginalized and dependent RVs cannot be marginalized efficiently. You can try splitting the marginalized RV into separate components and marginalizing them separately.

Which might be because of the two indexing bits here :

[pt.shared(exp_mean)[:,ind2cluster[k]][obs2component[:,ind2cluster[k]]] for k in range(nb_ind)]

I tried versions without part slicing such as

feat2component = [pm.Categorical('obscomp'+str(idx),p=poisson_component_weights[idx,:],shape=nb_features) for idx in range(nb_clusters)]
observation = pm.Poisson('obs', mu=[pm.math.stack(exp_RC_per_cluster)[ind2cluster[k]][pm.math.stack(feat2component)[ind2cluster[k]]] for k in range(nb_ind)],observed=simu_ind)

And still does not work. However because of the pm.math.stack I thought I would get the Join error but did not get it…

ricardoV94 · November 19, 2024, 10:24am

You still have a loop in the observation, can you get rid of it with advanced indexing? Also can you provide a fully reproducible snippet? Can’t test it otherwise

Sopchi · November 20, 2024, 10:05am

Sure, would this work ?

import numpy as np
import pymc as pm
import pytensor as pt
from pymc_experimental import MarginalModel

nb_clusters = 3
nb_ind = 5
nb_features = 8

simu_ind = np.array([[151, 176, 181, 146, 135, 134, 136, 180],
       [ 62,  80, 132, 105, 138,  88,  70, 108],
       [141, 192, 178, 123, 120, 156, 155, 154],
       [ 17,  41,  16,  18,  23,  21,  13,  11],
       [ 18,  33,  14,  16,  16,  19,  15,  21]])

exp_mean = np.array([[ 18.,  25.,  20.],
        [ 36.,  49.,  40.],
        [ 54.,  74.,  60.],
        [ 72.,  98.,  79.],
        [ 90., 123.,  99.]])


with MarginalModel() as full_model_marg_test:

    cluster_weights = pm.Dirichlet("cluster_weights", a=pt.tensor.ones(nb_clusters))

    ind2cluster = pm.Categorical('ind', p=cluster_weights,shape=nb_ind) 

    poisson_component_weights = pm.Dirichlet("poisson_component_weights", a=pt.tensor.ones(shape=(nb_clusters,5))) 
    feat2component = pm.Categorical('obscomp',p=poisson_component_weights,shape=(nb_features,nb_clusters)) 

    columns = ind2cluster*pt.tensor.ones(shape=(nb_features,nb_ind))
    row_feat = (pt.tensor.ones(shape=(nb_ind,nb_features))*pt.tensor.arange(nb_features)).T
    rows = feat2component[row_feat.astype(dtype=np.intp),columns.astype(dtype=np.intp)]
    all_clusters_parameters = pt.tensor.as_tensor(exp_mean)[rows.astype(dtype=np.intp),columns.astype(dtype=np.intp)] 

    observation = pm.Poisson('obs',mu=all_clusters_parameters, observed=simu_ind)
    
    full_model_marg_test.marginalize([ind2cluster]) 

    full_model_marg_test.marginalize([feat2component])

When running it I now have this error:

ValueError: Partial slicing or advanced integer indexing of known dimensions not supported.

NotImplementedError: The graph between the marginalized and dependent RVs cannot be marginalized efficiently. You can try splitting the marginalized RV into separate components and marginalizing them separately.

Topic		Replies	Views
Mixture model with sample specific weights Questions bug	1	660	April 12, 2023
Poisson Mixture Model & Label Switching v5	7	1481	August 3, 2022
Learning Poisson State Space Model parameters for large number of samples version agnostic modeling	14	674	May 10, 2023
Marginals of joint distributed parameters Questions	0	297	October 13, 2021
Properly sampling mixture models Questions	9	4453	March 26, 2018

Issues when using MarginalModel on clustering model

Related topics