I am building a supervised LDA on PyMC based on the following algorithm (Mcauliffe, Blei 2007):
I am having difficulty writing efficient PyMC codes to get the average topic frequencies i.e., z_bar in the above algorithm and running the attached code is very slow. What is the best way to get the average topic frequencies and implement supervised LDA on PyMC?
My codes (modified the PyMC LDA codes from here)
def LDA_GLM(omega, y, K, M, N_V, Ni, alpha, gamma):
with pm.Model() as model:
eta = pm.Normal('b', mu=0, sigma=1, shape=K) # coefficient of linear regression on y
sigma2 = pm.InverseGamma('sigma2',alpha =1.2, beta= 1.5) # variance of y
phi = pm.distributions.Dirichlet('phi', a=gamma, shape=(K, N_V)) # topic word matrix
theta = pm.distributions.Dirichlet('theta', a=alpha, shape=(M, K)) # topic document matrix
omega = pm.DensityDist("doc", logp_lda_doc(phi, theta), observed=doc_t) # word document
Z = [pm.Categorical("z_{}".format(d), p=theta[d], shape=Ni[d]) for d in tqdm(range(M))] # topic assignment
Z_bar = [[pm.math.sum([Z[i] == k]) / Ni[i] for k in range(K)] for i in tqdm(range(len(Z)))] # average topic
Z_bar = pm.math.stack(Z_bar, axis = 0) # turn Z into design matrix
Y = pm.Normal('y', mu= pm.math.dot(Z_bar, eta), sigma = sigma2, shape =M, observed =y) # outcome variable
return model