How to implement supervised LDA model more efficiently on PyMC

I am building a supervised LDA on PyMC based on the following algorithm (Mcauliffe, Blei 2007):


I am having difficulty writing efficient PyMC codes to get the average topic frequencies i.e., z_bar in the above algorithm and running the attached code is very slow. What is the best way to get the average topic frequencies and implement supervised LDA on PyMC?

My codes (modified the PyMC LDA codes from here)

def LDA_GLM(omega, y, K, M, N_V, Ni, alpha, gamma):
	with pm.Model() as model:
		eta = pm.Normal('b', mu=0, sigma=1, shape=K) # coefficient of linear regression on y
		sigma2 = pm.InverseGamma('sigma2',alpha =1.2, beta= 1.5) # variance of y 
		phi = pm.distributions.Dirichlet('phi', a=gamma, shape=(K, N_V)) # topic word matrix
		theta = pm.distributions.Dirichlet('theta', a=alpha, shape=(M, K)) # topic document matrix 
		omega = pm.DensityDist("doc", logp_lda_doc(phi, theta), observed=doc_t) # word document
        Z = [pm.Categorical("z_{}".format(d), p=theta[d], shape=Ni[d]) for d in tqdm(range(M))] # topic assignment 
		Z_bar = [[pm.math.sum([Z[i] == k]) / Ni[i] for k in range(K)] for i in tqdm(range(len(Z)))] # average topic 
		Z_bar = pm.math.stack(Z_bar, axis = 0) # turn Z into design matrix
		Y = pm.Normal('y', mu= pm.math.dot(Z_bar, eta), sigma = sigma2, shape =M, observed =y) # outcome variable 
	return model

Thanks!!

Hi @Yixiang_Xu , please follow up on your original post How to speed up Pymc model compilation of a LDA topic model - #4 by Yixiang_Xu instead of opening a new post :slight_smile:.
I am closing this post for now.

1 Like