How to speed up Pymc model compilation of a LDA topic model

I am trying to implement the LDA model in Pymc with 1000 documents, each document has word length around 150 - 200 and the vocabulary size is 500. I wrote the following codes and it worked on a smaller sample (e.g., 10 document). However, the codes takes a very long time to compile with the current sample size (1000 documents, word length 150 - 200, 500 vocabulary). The most time consuming part is the last line of code which is to generate omega (the word document), with an estimated time cost 3 - 5 hours to compile (not even reaching the sampling stage). Is there any suggestions on how to speed up the code compilation?
FYI, the code is being implemented on Colab Pro +.

Codes -

def LDA_GLM(omeg, Ni, K, M, N_V, alpha, gamma):
# omega is the document
# Ni is the word count per document
# K is number of topics, M is number of documents, N_V is vocabulary size
	with pm.Model() as model:
		phi = pm.distributions.Dirichlet('phi', a=gamma, shape=(K, N_V))  # topic-vocabulary matrix
		theta = pm.distributions.Dirichlet('theta', a=alpha, shape=(M, K))  # topic-document matrix
		DATA_d = [pm.Categorical("z_{}".format(d), p=theta[d], shape=Ni[d]) for d in tqdm(range(M))]  # assign topic to each word token 
		omega = [pm.Categorical("w_%i_%i" % (d,i),
								p = phi[DATA_d[d][i]],
								#value=omeg[d][i], 
								observed=omeg[d][i])
								for d in tqdm(range(M)) for i in range(Ni[d])] # assign word to each word token based on the chosen topic 
    return model


M = 1000
K = 3
N_V = 500

alpha = np.array([10.0, 5.0, 8.0])
gamma = np.random.choice([10, 20, 30], N_V)

ldaglm = LDA_GLM(omega, Ni, K, M, N_V, alpha, gamma) 
1 Like

The loops you are using are going to be very slow. Using parameter arrays will allow things to be vectorized and thus much more efficient. I would suggest taking a look the LDA example found here for some suggestions.

1 Like

+1 to @cluhmann. Also, you should use mixture distribution as the current implementation is quite inefficient. I suggest you take a look at: Naive Bayes model with PyMC3 - #10 by junpenglao

2 Likes

@cluhmann @junpenglao Thanks so much for your responses. Help a lot. Follow up on this - I am building on the LDA model to turn it into a supervised LDA (Mcauliffe, Blei 2007):

I am having difficulty writing efficient codes to get the average topic frequencies i.e., z_bar in the above algorithm using Pymc so the code is still super slow. Below is my code. Do you have any insights to turn it into more efficient? Thanks so much !!!

def LDA_GLM(omega, y, K, M, N_V, Ni, alpha, gamma):
	with pm.Model() as model:
		eta = pm.Normal('b', mu=0, sigma=1, shape=K) # coefficient of linear regression on y
		sigma2 = pm.InverseGamma('sigma2',alpha =1.2, beta= 1.5) # variance of y 
		phi = pm.distributions.Dirichlet('phi', a=gamma, shape=(K, N_V)) # topic word matrix
		theta = pm.distributions.Dirichlet('theta', a=alpha, shape=(M, K)) # topic document matrix 
		omega = pm.DensityDist("doc", logp_lda_doc(phi, theta), observed=doc_t) # word document
        Z = [pm.Categorical("z_{}".format(d), p=theta[d], shape=Ni[d]) for d in tqdm(range(M))] # topic assignment 
		Z_bar = [[pm.math.sum([Z[i] == k]) / Ni[i] for k in range(K)] for i in tqdm(range(len(Z)))] # average topic 
		Z_bar = pm.math.stack(Z_bar, axis = 0) # turn Z into design matrix
		Y = pm.Normal('y', mu= pm.math.dot(Z_bar, eta), sigma = sigma2, shape =M, observed =y) # outcome variable 
	return model

The model you are trying to implemented is almost exactly the one mentioned above: Naive Bayes model with PyMC3 - #10 by junpenglao. Did you take a look of that?

@junpenglao Thanks for pointing out! Now I am able to figure out. Appreciate it!

1 Like

@junpenglao After checking your post and other post, now I am able to develop the efficient codes which works perfect for MCMC (thank you so much!!), now I want to use the same code for Variational Inference, but run into the following error message: Discrete variables are not supported by VI: count ~ Multinomial. Do you have any suggestions on what i shall do in this case? Just to remind you - I am attempting to implement supervised LDA using Variational Inference. Thanks so much!!!

def LDA_GLM(omega, y, K, M, N_V, n, alpha, gamma):
	with pm.Model() as model:
		eta = pm.Normal('b', mu=0, sigma=1, shape=K) # coefficient of linear regression on y
		sigma2 = pm.InverseGamma('sigma2',alpha =1.2, beta= 1.5) # variance of y 
		phi = pm.distributions.Dirichlet('phi', a=gamma, shape=(K, N_V)) # topic word matrix
		theta = pm.distributions.Dirichlet('theta', a=alpha, shape=(M, K)) # topic document matrix 
		omega = pm.DensityDist("doc", logp_lda_doc(phi, theta), observed=doc_t) # word document
        X_count = pm.Multinomial("count", n, p=theta, shape=(M, K)) # this line of code did not work for VI 
		X = pm.Deterministic("X", X_count/n)
		Y = pm.Normal('y', mu= pm.math.dot(X, eta), sigma = sigma2, shape =M, observed =y) # outcome variable 
	return model


ldaglm = LDA_GLM(omega2, Y, W, K, M, N_V,  n, alpha, gamma_null)
with ldaglm:
    ldaglm_vi = pm.fit(method="advi")

Since X is latent, the easiest way is to write to pretend it is continuous:

Y = pm.Normal('y', mu= pm.math.dot(theta, eta), sigma = sigma2, shape =M, observed =y) # outcome variable 

There is also an old example of LDA with VI in PyMC3 you can get some inspiration from:
https://docs.pymc.io/en/v3/pymc-examples/examples/variational_inference/lda-advi-aevb.html

@junpenglao Thanks! it would be nice to replace X with theta, but they are not exactly the same model, still want to stick to the original model specification. I have also checked the link post - it is not however on supervised LDA. Do you happen to know what is the reason that VI in PyMC3 could not handle discrete variables? There is really few resource available on this issue. I barely could find one. It does not seem an issue for VI in theory to handle discrete variable, not sure why PyMC3 could not. Thanks so much!

It is however very much an issue for VI to handle discrete variable in practice as you most likely need gradient to train your model. Usually you need reparameterization trick like Gumbel-Softmax (see e.g., [1611.01144] Categorical Reparameterization with Gumbel-Softmax)

In your model, you sample a Multinomial with total count n, and then normalized it again by dividing n, which again gives you back the expectation of theta - modeling it this way is just extra stochastic-ness that get marginalized out, so I would advise to use theta directly and generate X_count after model inference.