Mixture of unigrams model - slow when large number of observations

ZachForrest · December 16, 2020, 1:22pm

Hey there folks,

Trying out pymc3 to try to cluster some search queries. As they’ew search queries, my belief is that each token in a query is sampled from one topic/distribution, rather than a a mixture as per LDA.

I’ve pinched some code from here: https://mathformachines.com/posts/bayesian-topic-modeling/#mixture-of-unigrams-naive-bayes

# Number of topics    
K = 3
data_index= docs['doc_id'].to_numpy()
data = docs['token_id'].to_numpy()
vocab = list(encoder.classes_)  # the vocabulary
V = len(vocab)
D = max(data_index) +1


# Pseudo-counts for topics and words.
alpha = np.ones(K)*0.8
beta = np.ones(V)*0.8

with pm.Model() as naive_model:
    # Global topic distribution
    theta = pm.Dirichlet("theta", a=alpha)

    # Word distributions for K topics
    phi = pm.Dirichlet("phi", a=beta, shape=(K, V))

    # Topic of documents
    z = pm.Categorical("z", p=theta, shape=D)

    # Words in documents
    p = phi[z][data_index]
    w = pm.Categorical("w", p=p, observed=data)

with naive_model:
    draw = 1000
    #Chains/cores has some bugs on mac it seems
    naive_trace = pm.sample(draw, tune=1000, chains=2, progressbar=True)

Sampling is speedy and the output is fine when i have a small number of observed docs (e.g 100), but when trying 1000, the sampling takes 100 times longer.

Any thoughts?

ZachForrest · January 14, 2021, 10:13am

So turns out its much more efficient to marginalise out the the categorical topic parameter as per this repo:

https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/master/PyMC3QnA/discourse_2314.ipynb

There’s a very good explanation here in the Stan docs: https://mc-stan.org/docs/2_18/stan-users-guide/latent-dirichlet-allocation.html

I think MCMC is still very slow for doing LDA with loads of dimensions compared to VI methods, planning to investigate those further! It feels overkill that we try to approximate the whole posterior considering the function of LDA (ie we dont really care about uncertainty, we just want ‘nice’ topics)

Still at the start of my Bayesian journey, so don’t quote me on that!

Topic		Replies	Views
Naive Bayes model with PyMC3 Questions	12	4819	December 9, 2018
Slow sampling rate for pymc3 model v3 sampling	0	389	July 13, 2022
Questions from replicating Latent Dirichlet Allocation work Questions	1	431	June 14, 2021
Simple Dirichlet Process Binomial Mixture Model samples slow Questions	3	583	April 17, 2019
What is the advantage of using pymc's Mixture distribution in a latent mixture model? Questions	7	858	November 4, 2023

Mixture of unigrams model - slow when large number of observations

Related topics