Bayesian Non-negative Matrix Factorization using a Gamma-Poisson model

Hi all,

I aim trying to develop a Bayesian NMF model to decompose genetic data (namely so called SNP marker data). The data is special in that case that it just contains 0, 1 or 2 values. I want to employ a Gamma-Poisson model with automatic relevance determination (ARD).
My current version is as follows:

# X has dimension nxfeatures
mu = np.mean(X)
alpha_prior = mu**(1/2) # based on the assumption that  mean_gamma_U * mean_gamma_V -> mean_poisson
with pm.Model() as pmf:
    alpha_U = pm.Gamma('alpha_U', alpha=1, beta=1, shape=n_components) # n_components represents the targeted latent space e.g. 5. 
    alpha_V = pm.Gamma('alpha_V', alpha=1, beta=1, shape=n_components)
    U = pm.Gamma('U', alpha=alpha_prior, beta=alpha_U, shape=(X.shape[0], n_components))
    V = pm.Gamma('V', alpha=alpha_prior, beta=alpha_V, shape=(X.shape[1], n_components))
    R = pm.Poisson('R',, V.T), observed=self.X)

alpha_U and alpha_V should be regulate the ARD process. However, I observe very similar values for each posterior alpha_U and alpha_V respectively, independent of the data. I created some synthetic data based on gamma distributions with latent space 3 but I still I obtain similar values for each alpha_U/_V even if I use latent space 5 for the model fit. On the other hand, I obtain with scikit-learns NMF reasonable decompositions. I tried as well to use alpha_U for both U and V without any improvement.
Can anyone help me with this or made similar experiences??
Thank you all in advance!



Well factorization problems can suffer from non-identifiability sometimes:

I don’t really know your specific problem but one could run into such a problem for U and V perhaps (scaling one up and the other down would still give the same result for instance)? Although signs of non-identifiability might be different than what you are observing its worth a check as is done above.
Are the posterior HDIs for alpha_U and alpha_V very large and containing the true values? That could also be a sign of non-identifiability. Looking at posterior plots and trace plots using arviz to see if you get multimodal behaviour is another way to check. It might also help others here if you either put some posterior plots or a full code with simulated data.

Hi @iavicenna,

thanks for your help!
I had a look on the non-identifiability issue. Indeed, putting a constraint can help (Kucukelbir et al. used an ordering constraint on one of the gamma matrices To simplify the issue, I decided to remove the priors on the latent space first and focus on obtaining the ground truth.
First about the data generation and the model, I use the following code:

import numpy as np
import pandas as pd
from pytensor import tensor as tt
import pymc as pm

def print_callback(approx, losses, i):
    if i % 100 == 0:
        print(f"Iteration: {i}, Loss: {losses[-1]}")

# Define the parameters for the gamma distributions
shape = [(1, 0.2, 0.2),
         (0.2, 1, 0.2),
         (0.2, 0.2, 1)]
scale = 0.5

# Initialize an empty matrix
U = np.zeros((100, 3))

# Fill the matrix
for row in range(100):
    if row < 40:
        for col in range(3):
            U[row, col] = np.random.gamma(shape[0][col], scale)
    elif row < 80:
        for col in range(3):
            U[row, col] = np.random.gamma(shape[1][col], scale)
        for col in range(3):
            U[row, col] = np.random.gamma(shape[2][col], scale)

# Define parameters for the gamma distribution
shape = 2
scale = 0.5

# Generate the matrix
V = np.random.gamma(shape, scale, (3, 10000))
lambdas = U @ V

y = np.random.poisson(lambdas)
df = pd.DataFrame(y)

value_counts = df.stack().value_counts().head()

n_components = 3
alpha_prior = lambdas.mean()

with pm.Model() as pmf:
    U_model = pm.Gamma('U', alpha=alpha_prior, beta=0.89, shape=(df.shape[0], n_components))
    V_raw_model = pm.Gamma('V_raw', alpha=1, beta=1, shape=(df.shape[1], n_components))

    increased_gamma_matrix = tt.cumsum(V_raw_model, axis=1)
    V_model = pm.Deterministic('V', increased_gamma_matrix)

    R = pm.Poisson('R',, V_model.T), observed=df)

with pmf:
    inference = pm.ADVI()
    callback = pm.callbacks.CheckParametersConvergence(tolerance=1e-2)
    approx =,
                    callbacks=[callback, print_callback])
    trace = approx.sample(10_000 // 2)

U_post = trace.posterior["U"].mean(axis=1)[0].values
V_post = trace.posterior["V"].mean(axis=1)[0].values

reconstruction = U_post @ V_post.T

So, I expect three latent spaces with higher expressions for “3” classes (namely with 40, 40 and 20 obs). The V matrix is unstructured.

My follow-up research will focus on clustering the U_post matrix. Thus, I decided to put the constraint on V to left U untouched. The results are however still not close to the ground truth.
The following shows a snippet of U_post:

One general point on workflow: I strongly suggest you ditch generating test data with numpy, and instead use your PyMC model directly to do it. It’s cleaner and helps avoid errors that can crop up whenever one has to do the same thing twice (to say nothing of subtle differences in parameterization/broadcasting between pymc and numpy).

Here’s how things could look in your case:

import pymc as pm
import arviz as az
import numpy as np

seed = sum(map(ord, 'Gamma Poisson Matrix'))
rng = np.random.default_rng(seed)

n_users = 100
n_items = 10
n_components = 3
n_obs = 10
coords = {'component': np.arange(n_components, dtype='int'),
          'user': np.arange(n_users, dtype='int'),
          'item': np.arange(n_items, dtype='int'),
          'obs_idx': np.arange(n_obs, dtype='int')}
with pm.Model(coords=coords) as pmf:
    alpha_U = pm.Gamma('alpha_U', alpha=2, beta=1, dims=['component']) 
    alpha_V = pm.Gamma('alpha_V', alpha=2, beta=1, dims=['component'])
    U = pm.Gamma('U', alpha=2, beta=alpha_U, dims=['user', 'component'])
    V = pm.Gamma('V', alpha=2, beta=alpha_V, 
                 dims=['item', 'component'])
    R = pm.Poisson('R', mu = U @ V.cumsum(axis=-1).T, dims=['obs_idx', 'user', 'item'])
    prior = pm.sample_prior_predictive(random_seed=rng)

# Choose a prior draw at random to serve as the ground truth data
random_idx = rng.choice(prior.prior.draw)
test_sample = prior.prior.sel(chain=0, draw=random_idx)

# Use pm.observe to clone your model, inserting the chosen draw as observed data, and attempt to recover the parameters
with pm.observe(pmf, {'R':test_sample.R.values}) as m:
    idata = pm.sample(nuts_sampler='nutpie')

# Make a bunch of diagnostic plots, for example:
az.plot_trace(idata, var_names=[ for x in m.free_RVs]);
az.plot_posterior(idata, var_names=['alpha_U'], ref_val=test_sample.alpha_U.values.tolist())
az.plot_posterior(idata, var_names=['alpha_V'], ref_val=test_sample.alpha_V.values.tolist())
az.plot_posterior(idata, var_names=['U'], 

As for the actual modeling, the transformation suggested in the paper (and mentioned in the old issue you linked) is pm.distributions.transforms.ordered. If you use it, you also have to set inital values for the distribution using the initvals keyword of pm.sample, or else you can end up with np.nan at the sampler starting point.

I also wasn’t super clear on the structure of your data, if it’s just one giant matrix or if you have a stack of matrices (like over time/experiments), so I added that obs_idx as the batch dim. That could be removed, obviously. I took the dim names from the example in the linked paper’s appendix (they’re obviously not relevant to your model). I recommend you use named dims to help communicate your model to others.