@cluhmann @junpenglao Thanks so much for your responses. Help a lot. Follow up on this - I am building on the LDA model to turn it into a supervised LDA (Mcauliffe, Blei 2007):
I am having difficulty writing efficient codes to get the average topic frequencies i.e., z_bar in the above algorithm using Pymc so the code is still super slow. Below is my code. Do you have any insights to turn it into more efficient? Thanks so much !!!
def LDA_GLM(omega, y, K, M, N_V, Ni, alpha, gamma):
with pm.Model() as model:
eta = pm.Normal('b', mu=0, sigma=1, shape=K) # coefficient of linear regression on y
sigma2 = pm.InverseGamma('sigma2',alpha =1.2, beta= 1.5) # variance of y
phi = pm.distributions.Dirichlet('phi', a=gamma, shape=(K, N_V)) # topic word matrix
theta = pm.distributions.Dirichlet('theta', a=alpha, shape=(M, K)) # topic document matrix
omega = pm.DensityDist("doc", logp_lda_doc(phi, theta), observed=doc_t) # word document
Z = [pm.Categorical("z_{}".format(d), p=theta[d], shape=Ni[d]) for d in tqdm(range(M))] # topic assignment
Z_bar = [[pm.math.sum([Z[i] == k]) / Ni[i] for k in range(K)] for i in tqdm(range(len(Z)))] # average topic
Z_bar = pm.math.stack(Z_bar, axis = 0) # turn Z into design matrix
Y = pm.Normal('y', mu= pm.math.dot(Z_bar, eta), sigma = sigma2, shape =M, observed =y) # outcome variable
return model