Large Scale Factor Analysis with minibatch ADVI

Hi, I’m new to PyMC3 and am using it in my master dissertation. I have a dataset of roughly 1.3 million datapoints, each of 20 dimensions, and 40% of the data matrix are missing.

My goal is to reduce its dimensionality for further analysis. The way I deal with missing values is simple - assign a 20 dimensional Bernoulli mask, that either returns the “true” value (when entry is observed) or a placeholder (when the entry is missing). To make it run on ADVI, I have integrated away the Bernoulli latent and got a mixture of a Dirac distribution centered on missing placeholder, and a Gaussian component centered on the “true” value.

My goal is to recover the full factor loading matrix (NxR) for further analysis; however, given my datasize, I couldn’t specify my V matrix to have a shape of (N, R). Is there any simple fix for this problem? Or should I infer my DxR factor matrix with minibatches and algorithmically obtain the NxR loading matrix?

On a slightly unrelated note, I am also looking to expand my current implementation to use the Indian Buffet Process, using the following generative approach:
pi_r ~ Beta(alpha / R, 1)
z_nr ~ Bernoulli(pi_r)
Factor loading matrix becomes V x Z

Which I believe is not too difficult to implement, since I can again, integrate out the Beroulli latents Z to obtain another mixture. Any thoughts?

Below is my code:

df_data = df_data.fillna(MISSING_PLACEHOLDER)

D = 20
N = len(df_data)
R = 3
batch_size = 100

minibatch = pm.Minibatch(df_data, batch_size=batch_size)

class NormalMissingValMixture(pm.Continuous):
	def __init__(self, beta, mu, precision, missing_mu, D, *args, **kwargs):
		super(NormalMissingValMixture, self).__init__(*args, **kwargs)
		self.beta = beta = mu
		self.missing_mu = missing_mu
		self.precision = precision
		self.D = D
		self.mean = pm.floatX(np.zeros(D))

	def logp(self, x):
		missing_likelihood = tt.log(self.beta) + pm.Constant.dist(self.missing_mu).logp(x)
		observed_likelihood = tt.log(1 - self.beta) + pm.Normal.dist(, tau=self.precision).logp(x)
		logps = [missing_likelihood, observed_likelihood]
		return tt.sum(logsumexp(tt.stacklists(logps)[:, :x.shape[0]], axis=0))

with pm.Model() as model:
		ard_prior = pm.InverseGamma('ard', 1., 1.)
		U = pm.Normal('factors', 0, ard_prior, shape=(D, R))
		V = pm.Normal('loadings', 0, ard_prior, shape=(batch_size, R))
		tau = pm.Gamma('tau', 1.0, 1.0, shape=batch_size * D)
		lambda_ = pm.Uniform('lambda', 0, 5, shape=batch_size * D)

		# True Distribution
		mu =, U.T)
		precision = tt.reshape(tau * lambda_, (batch_size, D))

		beta = pm.Beta('mask', 1, 1, shape=D)
		Y = NormalMissingValMixture('observations', beta, mu, precision, MISSING_PLACEHOLDER, D, observed=minibatch)

		advifit =, method='advi', callbacks=[pm.callbacks.CheckParametersConvergence()])


Hi Hugo,

I have been thinking about this but do not have a good insight, maybe one thing worth to try is used the masked array to code the data matrix (as in our examples), but passing a local RV to Variational Inference as coded for the latent variable. In the local RV you can have a small neural network that specifically dealt with the missing variables.

Sure, I’ll try and see how it goes, thanks. Although I am more worried about how to retrieve that NxR matrix V.

That said, even though I couldn’t assess how good this approch to missing value is without the full matrix V, ELBO does decrease by a few orders of magnitude and eventually converged, which at leasts suggest this model is numerically stable.

On the issue with Indian Buffet Process, I have the stick breaking approach as follows:

def stick_breaking_BBP(beta):
	return tt.extra_ops.cumprod(beta)

def create_BetaBernoulliProcess(alpha, K):
	beta = pm.Beta('beta', alpha, 1, shape=K)
	return pm.Deterministic('v', stick_breaking_BBP(beta))

class BetaBernoulliLatents(pm.Continuous):
	def __init__(self, pis, normal_prior, variance_prior, *args, **kwargs):
		super(BetaBernoulliLatents, self).__init__(*args, **kwargs)
		self.pis = pis
		self.normal_prior = normal_prior
		self.variance_prior = variance_prior
		self.mean = pm.floatX(np.zeros([batch_size, R]))

	def logp(self, x):
		chosen_likelihood = tt.log(self.pis) + pm.Normal.dist(self.normal_prior, self.variance_prior).logp(x)
		ignore_likelihood = tt.log(1 - self.pis) + pm.Constant.dist(0.).logp(x)
		logps = [chosen_likelihood, ignore_likelihood]
		return tt.sum(logsumexp(tt.stacklists(logps)[:, :x.shape[0]], axis=0))

def fit_IBP_FA():
	with pm.Model() as model:
		ard_prior = pm.InverseGamma('ard', 1., 1.)
		uv_prior = pm.Normal('uv_prior', 0., 20.)
		U = pm.Normal('factors', uv_prior, ard_prior, shape=(D, R))
		alpha = pm.Gamma('alpha', 1.0, 1.0)
		pis = create_BetaBernoulliProcess(alpha, R)
		V = BetaBernoulliLatents('bb_latents', pis, uv_prior, ard_prior, shape=(batch_size, R))
		tau = pm.Gamma('tau', 1.0, 1.0, shape=batch_size * D)
		lambda_ = pm.Uniform('lambda', 0, 5, shape=batch_size * D)

		# True Distribution
		mu = * choices, U.T)
		precision = tt.reshape(tau * lambda_, (batch_size, D))

		beta_mask = pm.Beta('mask', 1, 1, shape=D)
		Y = NormalMissingValMixture('observations', beta_mask, mu, precision, MISSING_PLACEHOLDER, D, observed=df_multi_strength[:batch_size])

However, the Bernoulli mask matrix will have to be integrated away for ADVI to work - this again will create a binary mixture of a Dirac centered at zero, and a Normal centered at wherever the latents are supposed to be - which results in a spike and slab latent that meanfield variational approximation is not designed to work with.

Sampling should in theory work though.


Yep, it is a very good approach to try on a small model and sample from it, then compare with the ADVI result. But be aware that even if it works in small model, in larger model ADVI with mean-field the variances of the parameters are likely underestimated. You should also try full-Rank ADVI.

Hi Hugo,

This is a great discussion.

Can you briefly explain quickly why you need to a Dirac distribution on the missing placeholder to use ADVI? I don’t quite follow the logic.


Hi alphamaximus,

So my dataset is 20 dimensional, with roughly 1.3 million data points. Unfortunately it’s got a lot of missing values. Of the 20 dimensions, the best is missing at 11%, the worst is missing at 85%. But even with 85% missingness, there are still roughly 195000 entries in that column - which means they may still be valuable for inference.

So my approach to handling missingness is pretty simple - maybe a bit naive, too. I simply placed a 20-dimensional Bernoulli layer on top of the unobserved “true” value of Y - so for each dimension of Y, if the Bernoulli variable is evaluated to 1, we observe the true value of Y - if it’s zero we consider it missing.

Due to my dataset having so much missingness, I thought using PyMC3’s built-in missing value mechanism may be unwise. Therefore I replaced all missing values with a -9999 placeholder.

Also, ADVI does not support inference on discrete latents, so I had to explicitly marginalize away the Bernoulli switches in the code. This results in a binary mixture with mixing proportions determined by p, where p parameterizes the Bernoulli switch.

So, with probability p we have a dirac on -9999 (that is, we don’t observe the “true” Y), and with probability 1 - p we observe it.

Hope this clarifies.

Actually looking back I should have maybe used a supervised learning approach instead.

1 Like

Thanks, I understand now.

I’ve been playing with missing values myself, and I recently discovered this paper:

Bayesian PCA gets good results. It would be great if someone implemented this for the PyMC3 documentation.

You might be also interested in this library:

1 Like

Also, instead of doing the Bernoulli layer, have you thought about just doing a simple mean value imputation before doing factor analysis, similar to what was done here:

# Perform mean value imputation
nan_mask = np.isnan([nan_mask] =[~nan_mask].mean()
1 Like

Not sure about the effects of replacing 85% of my missingness with imputed mean - a) that’s just way too much imputation, and b) single imputation is soooo un-Bayesian…

However, I’ve been looking into Multiple Imputation recently, and this book seems pretty good:

FancyImpute supports most of the single imputation strategies, but it seems its MICE algorithm only supports ordinal data.

For Bayesian PCA in PyMC3, I am a bit worried we are going to run into exactly the same problem with FA - if the dataset is large enough, we simply can’t specify a matrix U = Normal(‘factor_loadings’, mu=mu_prior, tau=tau_prior, shape=(N, K)) with large N.

Another thing I wanted to mention is the mixture of a dirac and a normal would result in a “spike and slab” distribution. This is particularly problematic in the Indian Buffet Process (IBP), where we also have a Bernoulli “mask” matrix that needs to have the Bernoulli latents marginalized away. Unlike my missing value mixture, where it is observed (thus no inference needed), this Beroulli mask for IBP is completely hidden - which means its posterior needs to be inferred. However, it is quite unlikely that ADVI could come up with a reasonable approximation for an “uglily shaped” spike-and-slab posterior.


1 Like

Here is a Bayesian PCA implementation in PyMC3: It’s based on I don’t think it would scale particularly well, as expressed previously.

1 Like