LDA implementation with pymc3

gaddamanil16 · July 16, 2018, 4:11pm

Hi,

I am implementing LDA with pymc3. I referred to the code for pymc

import numpy as np
import pymc as pm

K = 2 # number of topics
V = 4 # number of words
D = 3 # number of documents

data = np.array([[1, 1, 1, 1], [1, 1, 1, 1], [0, 0, 0, 0]])

alpha = np.ones(K)
beta = np.ones(V+1)

theta = pm.Container([pm.Dirichlet("theta_%s" % i, theta=alpha) for i in range(D)])
phi = pm.Container([pm.Dirichlet("phi_%s" % k, theta=beta) for k in range(K)])
Wd = [len(doc) for doc in data]

z = pm.Container([pm.Categorical('z_%i' % d, 
                             p = theta[d], 
                             size=Wd[d],
                             value=np.random.randint(K, size=Wd[d]),
                             verbose=1)
              for d in range(D)])


w = pm.Container([pm.Categorical("w_%i" % d,
                             p = pm.Lambda('phi_z_%i' % d, lambda z=z, phi=phi: [phi[z[d][i]] for i in range(Wd[d])]),
                             value=data[d], 
                             observed=True, 
                             verbose=1)
              for d in range(D)])

model = pm.Model([theta, phi, z, w])
mcmc = pm.MCMC(model)
mcmc.sample(100, burn=10)

I am trying to use it for pymc3 but I am facing problems while defining

w

import numpy as np
import pymc3 as pm, theano, theano.tensor as t

K = 2 # number of topics
V = 4 # number of words
D = 3 # number of documents

data = np.array([[1, 1, 1, 1], [1, 1, 1, 1], [0, 0, 0, 0]])

alpha = np.ones(K)
beta = np.ones(V)
model = pm.Model()
Wd = [len(doc) for doc in data]
(D, W) = data.shape
#print(Wd[])
@theano.compile.ops.as_op(itypes=[t.lscalar, t.dscalar],otypes=[t.dvector])
def p(z=z, phi=phi):
    return [phi[z[i,j]] for i in range(D) for j in range(Wd)]

with model: 
    theta = [pm.Dirichlet("pthetax_%s" % i, a=alpha, shape=K) for i in range(D)]
    phi =[pm.Dirichlet("pphix_%s" % k, a=beta,shape=V) for k in range(K)]
   
    
    z = [pm.Categorical('zx_%i' % d, 
                         p = theta[d], 
                         shape=W)
                      for d in range(D)]
     
    w = [pm.Categorical("wx_%i_%i" % (d,i),
                        p = p(z,phi), observed = data[d][i])
                      for d in range(D) for i in range(W)]
#model = pm.Model([theta, phi, z, w])
with model:    
    step1 = pm.Metroplolis(vars = [theta,phi,z,w])
    tr = step1.sample(1000,step = [step1])
    
pm.plots.traceplot(tr, ['theta', 'phi', 'z','w']);

But everytime I am getting the error:

AttributeError: ‘list’ object has no attribute ‘type’

Can someone please help me with this? Or is there any other way of implementing LDA using pymc3.

junpenglao · July 16, 2018, 6:23pm

There is a LDA example in the doc you should check out: http://docs.pymc.io/notebooks/lda-advi-aevb.html

gaddamanil16 · July 16, 2018, 7:12pm

thanks @junpenglao! But it is far more complicated for me at this stage. I want to implement it on a small scale now. And i am facing problems defining w in my case. And the link uses DensityDist but I want to do it without that.

junpenglao · July 16, 2018, 9:10pm

Categorical is not implemented to handle high dimensional p, so you need to carefully validate the shape and reshape the input parameters:

alpha = np.ones((1, K))
beta = np.ones((1, V))
with pm.Model() as model:
    thetas = pm.Dirichlet("thetas", a=alpha, shape=(D, K))
    phis = pm.Dirichlet("phis", a=beta, shape=(K, V))
    z = pm.Categorical("zx", p=thetas, shape=(W, D))
    w = pm.Categorical("wx", 
                       p=t.reshape(phis[z], (D*W, V)), 
                       observed=data.reshape(D*W))

Something like this seems to work, but you should doubt check to make sure the shape and reshape is behaving as intended.

gaddamanil16 · July 17, 2018, 5:39pm

Thanks @junpenglao. It worked well. But when I am trying to sample it using

with model:    
step1 = pm.Metropolis()
tr = pm.sample(1000,step = step1, chains=1)

I am getting this error. Shaping and Reshaping seems to work fine in this case.

IndexError: index 3 is out of bounds for size 2
Apply node that caused the error: AdvancedSubtensor1(phis, Subtensor{int64:int64:}.0)

gaddamanil16 · July 17, 2018, 5:58pm

But using tr = pm.sample(1000,chains = 1) works. I am not able to understand which one works in what way?

junpenglao · July 17, 2018, 6:40pm

Metropolis does not generate valid proposal for Categorical Random Variables, see https://github.com/pymc-devs/pymc3/issues/3023#issuecomment-396288202

gaddamanil16 · July 18, 2018, 6:19pm

Thanks a lot !

gaddamanil16 · July 19, 2018, 12:42am

I was going through the example but I am facing trouble understanding how the below function works.

    def ll_docs_f(docs):
            dixs, vixs = docs.nonzero()
            vfreqs = docs[dixs, vixs]
            ll_docs = vfreqs * pmmath.logsumexp(
                tt.log(theta[dixs]) + tt.log(beta.T[vixs]), axis=1).ravel()

            # Per-word log-likelihood times num of tokens in the whole dataset
            return tt.sum(ll_docs) / (tt.sum(vfreqs)+1e-9) * n_tokens

Because the dimensions of theta and beta.T will not be the same so there should be a dimension mismatch error. I am not using the same case and dataset as they are in the example but creating a dummy data for my implementation of ADVI on the model using

with model: 
     theta = pm.Dirichlet("thetas", a=alpha, shape=(D, K))
     phi = pm.Dirichlet("phis", a=beta, shape=(K, V))
     doc = pm.DensityDist('doc', log_lda(theta,phi), observed=data)   

def log_lda(theta,phi):
    ll = pm.math.logsumexp(t.log(theta) + t.log(phi.T), axis =1)
    return ll

I get an error:
ValueError: Input dimension mis-match. (input[0].shape[0] = 3, input[1].shape[0] = 2)

Any insight would be helpful. Thanks.

junpenglao · July 19, 2018, 4:19am

It’s throwing an error because in t.log(theta) + t.log(phi.T), theta and phi.T has a different shape. You need to broadcast them both to the same shape first. In the example, there is an indexing to do that, and similarly you need to write the function to take observed as input also (for indexing)

gaddamanil16 · July 19, 2018, 3:20pm

Any suggestions on how I can do it. Because what I understand from the likelihood equation is to sum over all topics and over all the words in the documents. So, θd,k would be of dimension(D,1) and βk,w should be (V,1) . Am I correct? Should I then add zeroes to βk,w to make it equal dimensioned as θd,k or any other way?

junpenglao · July 19, 2018, 3:36pm

You should understand it as sum over all the (latent) topics, as the word is the actual observed. So the general idea is that for each specific observed word, use the observed (a tokenized index) to index across all topics, and sum over the topics. And you do that for all the observed words, then sum over the resulting vector.

gaddamanil16 · July 19, 2018, 6:15pm

Thanks.

gaddamanil16 · August 1, 2018, 6:42pm

@junpenglao I am trying to implement the LDA model for a dataset from Scikit learn. For a dummy dataset my model works fine but in this case it gives MemoryError: None .

My code is :

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import pymc3 as pm, theano.tensor as t
from theano import shared

dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data

no_features = 1000

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

# we have sparse dataset. It's better to have dence batch so that all words accure there
#minibatch_size = 128

# defining minibatch
#doc_t_minibatch = pm.Minibatch(tf.toarray(), minibatch_size)
#doc_t = shared(tf.toarray()[:minibatch_size])

K = 20 # number of topics
V = 1000 # number of words

alpha = np.ones((1, K))
beta = np.ones((1, V))
model = pm.Model()
(D, W) = tf.shape
        
with model: 
    theta = pm.Dirichlet("thetas", a=alpha, shape=(D, K))
    phi = pm.Dirichlet("phis", a=beta, shape=(K, V))
    z = pm.Categorical("zx", p=theta, shape=(W,D))
    w = pm.Categorical("wx", 
                       p=t.reshape(phi[z.T], (D*W, V)), 
                       observed=tf.reshape(D*W))
    
   
with model:    
    tr = pm.sample(1000,chains = 1)
    pm.plots.traceplot(tr, ['thetas','phis']);

Can you please help me with this? What could go wrong?

junpenglao · August 1, 2018, 9:40pm

Looks like you dont have enough memory. Would you be able to access a machine with more memory?

gaddamanil16 · August 2, 2018, 3:03pm

But when i redefine the model for ADVI, it is working.

junpenglao · August 2, 2018, 3:23pm

You shouldnt use ADVI as it doesnt work when you have discrete variable in your model.

gaddamanil16 · August 2, 2018, 3:26pm

For ADVI I tried using :

def log_lda(theta,phi):
    def ll_lda(value):  
        dixs, vixs = value.nonzero()
        vfreqs = value[dixs, vixs]
        ll =vfreqs* pm.math.logsumexp(t.log(theta[dixs]) + t.log(phi.T[vixs]), axis = 1).ravel()
        return t.sum(ll) 
    return ll_lda

with model: 
    theta = pm.Dirichlet("thetas", a=alpha, shape=(D, K))
    phi = pm.Dirichlet("phis", a=beta, shape=(K, V))
    doc = pm.DensityDist('doc', log_lda(theta,phi), observed=doc_t)

junpenglao · August 2, 2018, 3:39pm

And sampling from this model gives you memory error as well?

gaddamanil16 · August 2, 2018, 3:44pm

No it doesn’t give me memory error while sampling.

Topic		Replies	Views
Questions from replicating Latent Dirichlet Allocation work Questions	1	431	June 14, 2021
How to speed up Pymc model compilation of a LDA topic model Questions	9	1276	August 23, 2022
Mixture of unigrams model - slow when large number of observations Questions	1	433	January 14, 2021
Naive Bayes model with PyMC3 Questions	12	4820	December 9, 2018
Module 'pymc' has no attribute 'Lambda'; module 'pymc' has no attribute 'Container' v5 modeling	3	969	August 15, 2022

LDA implementation with pymc3

Related topics