Mixture of unigrams model - slow when large number of observations

Hey there folks,

Trying out pymc3 to try to cluster some search queries. As they’ew search queries, my belief is that each token in a query is sampled from one topic/distribution, rather than a a mixture as per LDA.

I’ve pinched some code from here: https://mathformachines.com/posts/bayesian-topic-modeling/#mixture-of-unigrams-naive-bayes

# Number of topics    
K = 3
data_index= docs['doc_id'].to_numpy()
data = docs['token_id'].to_numpy()
vocab = list(encoder.classes_)  # the vocabulary
V = len(vocab)
D = max(data_index) +1

# Pseudo-counts for topics and words.
alpha = np.ones(K)*0.8
beta = np.ones(V)*0.8

with pm.Model() as naive_model:
    # Global topic distribution
    theta = pm.Dirichlet("theta", a=alpha)

    # Word distributions for K topics
    phi = pm.Dirichlet("phi", a=beta, shape=(K, V))

    # Topic of documents
    z = pm.Categorical("z", p=theta, shape=D)

    # Words in documents
    p = phi[z][data_index]
    w = pm.Categorical("w", p=p, observed=data)

with naive_model:
    draw = 1000
    #Chains/cores has some bugs on mac it seems
    naive_trace = pm.sample(draw, tune=1000, chains=2, progressbar=True)

Sampling is speedy and the output is fine when i have a small number of observed docs (e.g 100), but when trying 1000, the sampling takes 100 times longer.

Any thoughts?

So turns out its much more efficient to marginalise out the the categorical topic parameter as per this repo:


There’s a very good explanation here in the Stan docs: https://mc-stan.org/docs/2_18/stan-users-guide/latent-dirichlet-allocation.html

I think MCMC is still very slow for doing LDA with loads of dimensions compared to VI methods, planning to investigate those further! It feels overkill that we try to approximate the whole posterior considering the function of LDA (ie we dont really care about uncertainty, we just want ‘nice’ topics)

Still at the start of my Bayesian journey, so don’t quote me on that!

1 Like