Hey there folks,
Trying out pymc3 to try to cluster some search queries. As they’ew search queries, my belief is that each token in a query is sampled from one topic/distribution, rather than a a mixture as per LDA.
I’ve pinched some code from here: https://mathformachines.com/posts/bayesian-topic-modeling/#mixture-of-unigrams-naive-bayes
# Number of topics K = 3 data_index= docs['doc_id'].to_numpy() data = docs['token_id'].to_numpy() vocab = list(encoder.classes_) # the vocabulary V = len(vocab) D = max(data_index) +1 # Pseudo-counts for topics and words. alpha = np.ones(K)*0.8 beta = np.ones(V)*0.8 with pm.Model() as naive_model: # Global topic distribution theta = pm.Dirichlet("theta", a=alpha) # Word distributions for K topics phi = pm.Dirichlet("phi", a=beta, shape=(K, V)) # Topic of documents z = pm.Categorical("z", p=theta, shape=D) # Words in documents p = phi[z][data_index] w = pm.Categorical("w", p=p, observed=data) with naive_model: draw = 1000 #Chains/cores has some bugs on mac it seems naive_trace = pm.sample(draw, tune=1000, chains=2, progressbar=True)
Sampling is speedy and the output is fine when i have a small number of observed docs (e.g 100), but when trying 1000, the sampling takes 100 times longer.