Hey there folks,
Trying out pymc3 to try to cluster some search queries. As they’ew search queries, my belief is that each token in a query is sampled from one topic/distribution, rather than a a mixture as per LDA.
I’ve pinched some code from here: https://mathformachines.com/posts/bayesian-topic-modeling/#mixture-of-unigrams-naive-bayes
# Number of topics
K = 3
data_index= docs['doc_id'].to_numpy()
data = docs['token_id'].to_numpy()
vocab = list(encoder.classes_) # the vocabulary
V = len(vocab)
D = max(data_index) +1
# Pseudo-counts for topics and words.
alpha = np.ones(K)*0.8
beta = np.ones(V)*0.8
with pm.Model() as naive_model:
# Global topic distribution
theta = pm.Dirichlet("theta", a=alpha)
# Word distributions for K topics
phi = pm.Dirichlet("phi", a=beta, shape=(K, V))
# Topic of documents
z = pm.Categorical("z", p=theta, shape=D)
# Words in documents
p = phi[z][data_index]
w = pm.Categorical("w", p=p, observed=data)
with naive_model:
draw = 1000
#Chains/cores has some bugs on mac it seems
naive_trace = pm.sample(draw, tune=1000, chains=2, progressbar=True)
Sampling is speedy and the output is fine when i have a small number of observed docs (e.g 100), but when trying 1000, the sampling takes 100 times longer.
Any thoughts?