Beta process (or truncated Indian Buffet process) factor analysis in the vein of the Dirichlet Process example

I am trying to set up a truncated Beta Process factor analysis by using a stick breaking construction as here: Beta Process Factor Analysis. The analysis is in the same vein as the pymc3 Dirichlet process example, just applied to a Beta process.

In short, the Beta process places a prior over binary matrices Z of dimensions (num_observations X num_factors). Each observation is assumed to be produced from a linear combination of the factors, chosen according to the entries of Z - specifically, if Z_{ij}=1, then factor j contributes to observation i. Thus, observation i can be written as Obs_{i} = \sum_{j}Z_{ij}\times factor_{j}.

There’s a Beta prior on the factor probabilities \pi_{k} (or the probabilities underlying the 0/1 entries of Z) with parameters a and b (this is taken from the paper linked above). If there are K factors in the data, the factor probabilities can be pulled as: \pi_{k} \sim Beta(\frac{a}{K}, \frac{b(K-1)}{K}).

Here’s some code that generates dummy data - I chose the factors themselves iid from \mathcal{N}(5, 2). I have 100 observations and 10 factors in this dummy dataset.

import numpy as np
import pymc3 as pm
import theano.tensor as tt

# Data generation
# 100 observations, each produced from a linear combination of 10 latent features
# Binary indicator array (100x10) indicating which latent features are "on" for every observation
# Beta process prior over the binary indicator array
a = 5
b = 10
K = 10
data_p = np.random.beta(a/K, b*(K-1)/K, size = 10)
data_p = np.tile(data_p[None, :], (100, 1))
# Bernoulli indicator array of latent features for all observations
data_Z = np.random.binomial(n = 1, p = data_p)
# The latent features, pulled from a Normal distribution
latent_features_data = np.random.normal(loc = 5, scale = 2, size = 10)
# Multiply the latent features and indicator array
# Observations are linear combinations of latent features, so sum along the features axis
data = np.sum(data_Z * latent_features_data[None, :], axis = -1) + np.random.normal(loc = 0, scale = 1, size = 100)

And here’s my pymc3 model. This is the most naive implementation, and necessitates a BinaryGibbsMetropolis sampling of the Bernoulli Z. The performance isn’t very good, even after a few thousand steps of tuning, NUTS has a lot of divergences and prompts me to increase “target_accept”. Of course, the sampling was nowhere close to converging.

# pymc3 model
with pm.Model() as model:
	# Params for Beta distribution over factor probabilities
	a = pm.Gamma("a", 1, 1)
	b = pm.Gamma("b", 1, 1)
	p = pm.Beta("p", a/K, b*(K-1)/K, shape = 10)
	# Bernoulli indicator matrix of which factors are "on" for every observation
	Z = pm.Bernoulli("Z", p = tt.tile(p[None, :], (100, 1)), shape = (100, 10))
	# Latent factors - 10 of them
	factors = pm.Normal("factors", mu = 0, sd = 1, shape = 10)
	# Observations are linear combinations of the factors
	equation = tt.sum(Z * tt.tile(factors[None, :], (100, 1)), axis = -1)
	sd = pm.HalfCauchy("sd", 0.5)
	obs = pm.Normal("obs", mu = equation, sd = sd, observed = data)

with model:
	tr = pm.sample(tune = 4000, draws = 2000, njobs = 3)

Any tips/advice are very appreciated :slight_smile: !

This is usually the source of problem, as a high dimensional Z makes the sampling difficult, especially mixing NUTS with other samplers.

The main strategy for these kinds of problem is to marginalize out the discrete variable Z, by expressing the likelihood as a mixture distribution.
Another approach I sometimes take is to express the indexing matrix Z as a weight matrix between 0 and 1 (and summed to 1). It should not be confused with the marginalization although the result is quite similar and it is often a bit easier to rewrite your model. See eg this post: Indexing using a free random variable

Really cool idea @junpenglao, thanks for bringing it up. :smile:

I have another (possibly silly) question - the probability of my indicator variable depends on the Beta draws of p. I want the the proportion of 0/1 draws of Z to change as p changes. So, following your suit in the other discussion, this works perfectly (in theory):
p \sim Beta(\frac{a}{K}, \frac{b(K-1)}{K}) - the form of p is not super important in this discussion, just that Z below needs to be drawn according to this p
Then, Z \sim Beta(\frac{p}{n}, \frac{1-p}{n}) where n is sufficiently large, say n=1000.
This produces 0/1 draws of Z according to p, but the sampler has divergences (as you noted in the last discussion).

You handled this in the last discussion by sampling the Dirichlet through Gamma random variables, for me that would be pulling 2 Gamma random variables. So to get Z \sim Beta(\frac{p}{n}, \frac{1-p}{n}), I pull X \sim Gamma(\frac{p}{n}, 1) and Y \sim Gamma(\frac{1-p}{n}, 1) and set Z = \frac{X}{X+Y}. However, since p itself is between 0 and 1, most of the Gamma draws are extremely close to 0 and that gives NaN while normalizing. I know I am missing something simple here, but is there some way I can scale/draw differently from the Gammas to still get a sparse Beta distribution for Z.

Thanks so much again for your help, your idea is great!

Just to chime in with some updates, I tried calculating the Z from two Gamma distributed random variables as I mentioned above. Having n=10 prevents the Gamma variables from running too small, and that seems usable. However, the sampling still gives lots of divergences. Here’s my model snippet (running on the same data as above):

# Use Gamma random variables to pull the sparse Beta draws
n = 10
K = 10
with pm.Model() as model:
	# Params for Beta distribution over factor probabilities
	a = pm.Gamma("a", 1, 1)
	b = pm.Gamma("b", 1, 1)
	p = pm.Beta("p", a/K, b*(K-1)/K, shape = 10)
	p = tt.tile(p[None, :], (100, 1))
	# Bernoulli indicator matrix of which factors are "on" for every observation
	a1 = pm.Gamma('a1', alpha=p/n, beta=1, shape=(100, 10))
	a2 = pm.Gamma("a2", alpha = (1-p)/n, beta = 1, shape = (100, 10))
	Z = pm.Deterministic("Z", a1/(a1+a2))
	# Latent factors - 10 of them
	factors = pm.Normal("factors", mu = 0, sd = 1, shape = 10)
	# Observations are linear combinations of the factors
	equation = tt.sum(Z * tt.tile(factors[None, :], (100, 1)), axis = -1)
	sd = pm.HalfCauchy("sd", 0.5)
	obs = pm.Normal("obs", mu = equation, sd = sd, observed = data)

with model:
	tr = pm.sample(tune = 4000, draws = 2000, njobs = 3)

NUTS samples slowly at about 3 it/s, and has a bunch of divergences after tuning.

I tried your code - it does not quite work as the model converge to basically identify one common mean (with one non-zero factors) and correspondent p that close to 1, thus making the model unidentifiable.

A question: Looking at each row of the latent matrix Z_{ij}, some rows is all zero - is it intentional?

No, no row of Z should be all-zero. That was unintentional. That’s likely because pulling p \sim Beta(\frac{a}{K}, \frac{b(K-1)}{K}) during dummy data generation makes p really small, and so I get a bunch of zeros in Z. I have since then pulled p as Beta(5, 5) or something like that, then I don’t get the problem of all zeros in Z, but that doesn’t improve the sampling though.

I have also tried a different sort of stick-breaking during sampling - this scheme is very similar to Dirichlet process stick breaking. For the DP, we break off our new stick-lengths from the the portioning of the stick remaining, so that \pi_{k} = p_{k} \Pi_{i=1}^{k-1} (1-p_{i}). For the Beta process, we discard the “portion remaining” of the stick, and instead break off stick lengths from the already broken portion, so that \pi_{k} = \Pi_{i=1}^{k} p_{i}. The model now looks like this:

def stick_breaking(beta):
    portion_remaining = tt.concatenate([[1], tt.extra_ops.cumprod(beta)[:-1]])
    return beta * portion_remaining

# Use Gamma random variables to pull the sparse Beta draws
n = 10
K = 10
with pm.Model() as model:
	# Params for Beta distribution over factor probabilities
	#a = pm.Gamma("a", 10, 1)
	a = pm.Gamma("a", 1, 1)
	beta = pm.Beta("beta", a, 1, shape = 10)
	p = stick_breaking(beta)
	#b = pm.Gamma("b", 10, 1)
	#p = pm.Beta("p", a/K, b*(K-1)/K, shape = 10)
	#p = pm.Beta("p", a, b, shape = 10)
	#p = tt.tile(p[None, :], (100, 1))
	# Bernoulli indicator matrix of which factors are "on" for every observation
	a1 = pm.Gamma('a1', alpha=tt.tile(p[None, :], (100, 1))/n, beta=1, shape=(100, 10))
	a2 = pm.Gamma("a2", alpha = (1-tt.tile(p[None, :], (100, 1)))/n, beta = 1, shape = (100, 10))
	Z = pm.Deterministic("Z", a1/(a1+a2))
	# Latent factors - 10 of them
	factors = pm.Normal("factors", mu = 0, sd = 1, shape = 10)
	# Observations are linear combinations of the factors
	equation = tt.sum(Z * tt.tile(factors[None, :], (100, 1)), axis = -1)
	sd = pm.HalfCauchy("sd", 0.5)
	obs = pm.Normal("obs", mu = equation, sd = sd, observed = data)

with model:
	tr = pm.sample(tune = 4000, draws = 2000, njobs = 3)

This model doesn’t work as well, and I get a huge number of divergences after tuning NUTS.

I really appreciate the time you are taking to help me :slight_smile:

But multiple entries in one row of Z could be 1 right?

I kind of get the original model (with Z explicitly model as Bernoulli) to work, but the model is still unidentifiable: when you have eg two factor on the same row, with one is positive and the other is negative, since the data is sparse (it might only appear once) there is not enough information to estimate them.

@narendramukherjee , were you ever able to successfully perform inference in the Beta process factor analysis model? If so, do you have the code that worked and could you share it?

@Rylan_Schaeffer Unfortunately, it’s been a while since I was working on this and I don’t think I eventually finished off the work on the Beta process factor analysis model. Do let me know if you try it and find something that works for you :slight_smile:

1 Like