How to speed up Pymc model compilation of a LDA topic model

Yixiang_Xu · August 15, 2022, 4:35pm

I am trying to implement the LDA model in Pymc with 1000 documents, each document has word length around 150 - 200 and the vocabulary size is 500. I wrote the following codes and it worked on a smaller sample (e.g., 10 document). However, the codes takes a very long time to compile with the current sample size (1000 documents, word length 150 - 200, 500 vocabulary). The most time consuming part is the last line of code which is to generate omega (the word document), with an estimated time cost 3 - 5 hours to compile (not even reaching the sampling stage). Is there any suggestions on how to speed up the code compilation?
FYI, the code is being implemented on Colab Pro +.

Codes -

def LDA_GLM(omeg, Ni, K, M, N_V, alpha, gamma):
# omega is the document
# Ni is the word count per document
# K is number of topics, M is number of documents, N_V is vocabulary size
	with pm.Model() as model:
		phi = pm.distributions.Dirichlet('phi', a=gamma, shape=(K, N_V))  # topic-vocabulary matrix
		theta = pm.distributions.Dirichlet('theta', a=alpha, shape=(M, K))  # topic-document matrix
		DATA_d = [pm.Categorical("z_{}".format(d), p=theta[d], shape=Ni[d]) for d in tqdm(range(M))]  # assign topic to each word token 
		omega = [pm.Categorical("w_%i_%i" % (d,i),
								p = phi[DATA_d[d][i]],
								#value=omeg[d][i], 
								observed=omeg[d][i])
								for d in tqdm(range(M)) for i in range(Ni[d])] # assign word to each word token based on the chosen topic 
    return model


M = 1000
K = 3
N_V = 500

alpha = np.array([10.0, 5.0, 8.0])
gamma = np.random.choice([10, 20, 30], N_V)

ldaglm = LDA_GLM(omega, Ni, K, M, N_V, alpha, gamma)

cluhmann · August 15, 2022, 5:01pm

The loops you are using are going to be very slow. Using parameter arrays will allow things to be vectorized and thus much more efficient. I would suggest taking a look the LDA example found here for some suggestions.

junpenglao · August 15, 2022, 7:02pm

+1 to @cluhmann. Also, you should use mixture distribution as the current implementation is quite inefficient. I suggest you take a look at: Naive Bayes model with PyMC3 - #10 by junpenglao

Yixiang_Xu · August 16, 2022, 6:18am

@cluhmann @junpenglao Thanks so much for your responses. Help a lot. Follow up on this - I am building on the LDA model to turn it into a supervised LDA (Mcauliffe, Blei 2007):

I am having difficulty writing efficient codes to get the average topic frequencies i.e., z_bar in the above algorithm using Pymc so the code is still super slow. Below is my code. Do you have any insights to turn it into more efficient? Thanks so much !!!

def LDA_GLM(omega, y, K, M, N_V, Ni, alpha, gamma):
	with pm.Model() as model:
		eta = pm.Normal('b', mu=0, sigma=1, shape=K) # coefficient of linear regression on y
		sigma2 = pm.InverseGamma('sigma2',alpha =1.2, beta= 1.5) # variance of y 
		phi = pm.distributions.Dirichlet('phi', a=gamma, shape=(K, N_V)) # topic word matrix
		theta = pm.distributions.Dirichlet('theta', a=alpha, shape=(M, K)) # topic document matrix 
		omega = pm.DensityDist("doc", logp_lda_doc(phi, theta), observed=doc_t) # word document
        Z = [pm.Categorical("z_{}".format(d), p=theta[d], shape=Ni[d]) for d in tqdm(range(M))] # topic assignment 
		Z_bar = [[pm.math.sum([Z[i] == k]) / Ni[i] for k in range(K)] for i in tqdm(range(len(Z)))] # average topic 
		Z_bar = pm.math.stack(Z_bar, axis = 0) # turn Z into design matrix
		Y = pm.Normal('y', mu= pm.math.dot(Z_bar, eta), sigma = sigma2, shape =M, observed =y) # outcome variable 
	return model

junpenglao · August 17, 2022, 8:18am

The model you are trying to implemented is almost exactly the one mentioned above: Naive Bayes model with PyMC3 - #10 by junpenglao. Did you take a look of that?

Yixiang_Xu · August 17, 2022, 6:09pm

@junpenglao Thanks for pointing out! Now I am able to figure out. Appreciate it!

Yixiang_Xu · August 22, 2022, 11:39pm

@junpenglao After checking your post and other post, now I am able to develop the efficient codes which works perfect for MCMC (thank you so much!!), now I want to use the same code for Variational Inference, but run into the following error message: Discrete variables are not supported by VI: count ~ Multinomial. Do you have any suggestions on what i shall do in this case? Just to remind you - I am attempting to implement supervised LDA using Variational Inference. Thanks so much!!!

def LDA_GLM(omega, y, K, M, N_V, n, alpha, gamma):
	with pm.Model() as model:
		eta = pm.Normal('b', mu=0, sigma=1, shape=K) # coefficient of linear regression on y
		sigma2 = pm.InverseGamma('sigma2',alpha =1.2, beta= 1.5) # variance of y 
		phi = pm.distributions.Dirichlet('phi', a=gamma, shape=(K, N_V)) # topic word matrix
		theta = pm.distributions.Dirichlet('theta', a=alpha, shape=(M, K)) # topic document matrix 
		omega = pm.DensityDist("doc", logp_lda_doc(phi, theta), observed=doc_t) # word document
        X_count = pm.Multinomial("count", n, p=theta, shape=(M, K)) # this line of code did not work for VI 
		X = pm.Deterministic("X", X_count/n)
		Y = pm.Normal('y', mu= pm.math.dot(X, eta), sigma = sigma2, shape =M, observed =y) # outcome variable 
	return model


ldaglm = LDA_GLM(omega2, Y, W, K, M, N_V,  n, alpha, gamma_null)
with ldaglm:
    ldaglm_vi = pm.fit(method="advi")

junpenglao · August 23, 2022, 4:50am

Since X is latent, the easiest way is to write to pretend it is continuous:

Y = pm.Normal('y', mu= pm.math.dot(theta, eta), sigma = sigma2, shape =M, observed =y) # outcome variable

There is also an old example of LDA with VI in PyMC3 you can get some inspiration from:
https://docs.pymc.io/en/v3/pymc-examples/examples/variational_inference/lda-advi-aevb.html

Yixiang_Xu · August 23, 2022, 6:26am

@junpenglao Thanks! it would be nice to replace X with theta, but they are not exactly the same model, still want to stick to the original model specification. I have also checked the link post - it is not however on supervised LDA. Do you happen to know what is the reason that VI in PyMC3 could not handle discrete variables? There is really few resource available on this issue. I barely could find one. It does not seem an issue for VI in theory to handle discrete variable, not sure why PyMC3 could not. Thanks so much!

junpenglao · August 23, 2022, 6:39am

It is however very much an issue for VI to handle discrete variable in practice as you most likely need gradient to train your model. Usually you need reparameterization trick like Gumbel-Softmax (see e.g., [1611.01144] Categorical Reparameterization with Gumbel-Softmax)

In your model, you sample a Multinomial with total count n, and then normalized it again by dividing n, which again gives you back the expectation of theta - modeling it this way is just extra stochastic-ness that get marginalized out, so I would advise to use theta directly and generate X_count after model inference.

Topic		Replies	Views
LDA implementation with pymc3 Questions	21	5802	August 22, 2020
Questions from replicating Latent Dirichlet Allocation work Questions	1	425	June 14, 2021
Mixture of unigrams model - slow when large number of observations Questions	1	431	January 14, 2021
How to return documents assigned to topics? Questions	0	348	February 2, 2021
Advice for speeding up large multivariate models Questions	17	2141	April 12, 2018

How to speed up Pymc model compilation of a LDA topic model

Related topics