LDA implementation with pymc3

Hi,

I am implementing LDA with pymc3. I referred to the code for pymc

import numpy as np
import pymc as pm

K = 2 # number of topics
V = 4 # number of words
D = 3 # number of documents

data = np.array([[1, 1, 1, 1], [1, 1, 1, 1], [0, 0, 0, 0]])

alpha = np.ones(K)
beta = np.ones(V+1)

theta = pm.Container([pm.Dirichlet("theta_%s" % i, theta=alpha) for i in range(D)])
phi = pm.Container([pm.Dirichlet("phi_%s" % k, theta=beta) for k in range(K)])
Wd = [len(doc) for doc in data]

z = pm.Container([pm.Categorical('z_%i' % d, 
                             p = theta[d], 
                             size=Wd[d],
                             value=np.random.randint(K, size=Wd[d]),
                             verbose=1)
              for d in range(D)])


w = pm.Container([pm.Categorical("w_%i" % d,
                             p = pm.Lambda('phi_z_%i' % d, lambda z=z, phi=phi: [phi[z[d][i]] for i in range(Wd[d])]),
                             value=data[d], 
                             observed=True, 
                             verbose=1)
              for d in range(D)])

model = pm.Model([theta, phi, z, w])
mcmc = pm.MCMC(model)
mcmc.sample(100, burn=10)

I am trying to use it for pymc3 but I am facing problems while defining

w

import numpy as np
import pymc3 as pm, theano, theano.tensor as t

K = 2 # number of topics
V = 4 # number of words
D = 3 # number of documents

data = np.array([[1, 1, 1, 1], [1, 1, 1, 1], [0, 0, 0, 0]])

alpha = np.ones(K)
beta = np.ones(V)
model = pm.Model()
Wd = [len(doc) for doc in data]
(D, W) = data.shape
#print(Wd[])
@theano.compile.ops.as_op(itypes=[t.lscalar, t.dscalar],otypes=[t.dvector])
def p(z=z, phi=phi):
    return [phi[z[i,j]] for i in range(D) for j in range(Wd)]

with model: 
    theta = [pm.Dirichlet("pthetax_%s" % i, a=alpha, shape=K) for i in range(D)]
    phi =[pm.Dirichlet("pphix_%s" % k, a=beta,shape=V) for k in range(K)]
   
    
    z = [pm.Categorical('zx_%i' % d, 
                         p = theta[d], 
                         shape=W)
                      for d in range(D)]
     
    w = [pm.Categorical("wx_%i_%i" % (d,i),
                        p = p(z,phi), observed = data[d][i])
                      for d in range(D) for i in range(W)]
#model = pm.Model([theta, phi, z, w])
with model:    
    step1 = pm.Metroplolis(vars = [theta,phi,z,w])
    tr = step1.sample(1000,step = [step1])
    
pm.plots.traceplot(tr, ['theta', 'phi', 'z','w']);

But everytime I am getting the error:

AttributeError: ‘list’ object has no attribute ‘type’

Can someone please help me with this? Or is there any other way of implementing LDA using pymc3.

1 Like

There is a LDA example in the doc you should check out: http://docs.pymc.io/notebooks/lda-advi-aevb.html

thanks @junpenglao! But it is far more complicated for me at this stage. I want to implement it on a small scale now. And i am facing problems defining w in my case. And the link uses DensityDist but I want to do it without that.

Categorical is not implemented to handle high dimensional p, so you need to carefully validate the shape and reshape the input parameters:

alpha = np.ones((1, K))
beta = np.ones((1, V))
with pm.Model() as model:
    thetas = pm.Dirichlet("thetas", a=alpha, shape=(D, K))
    phis = pm.Dirichlet("phis", a=beta, shape=(K, V))
    z = pm.Categorical("zx", p=thetas, shape=(W, D))
    w = pm.Categorical("wx", 
                       p=t.reshape(phis[z], (D*W, V)), 
                       observed=data.reshape(D*W))

Something like this seems to work, but you should doubt check to make sure the shape and reshape is behaving as intended.

1 Like

Thanks @junpenglao. It worked well. But when I am trying to sample it using

with model:    
step1 = pm.Metropolis()
tr = pm.sample(1000,step = step1, chains=1)

I am getting this error. Shaping and Reshaping seems to work fine in this case.

IndexError: index 3 is out of bounds for size 2
Apply node that caused the error: AdvancedSubtensor1(phis, Subtensor{int64:int64:}.0)

But using tr = pm.sample(1000,chains = 1) works. I am not able to understand which one works in what way?

Metropolis does not generate valid proposal for Categorical Random Variables, see https://github.com/pymc-devs/pymc3/issues/3023#issuecomment-396288202

Thanks a lot !

I was going through the example but I am facing trouble understanding how the below function works.

    def ll_docs_f(docs):
            dixs, vixs = docs.nonzero()
            vfreqs = docs[dixs, vixs]
            ll_docs = vfreqs * pmmath.logsumexp(
                tt.log(theta[dixs]) + tt.log(beta.T[vixs]), axis=1).ravel()

            # Per-word log-likelihood times num of tokens in the whole dataset
            return tt.sum(ll_docs) / (tt.sum(vfreqs)+1e-9) * n_tokens

Because the dimensions of theta and beta.T will not be the same so there should be a dimension mismatch error. I am not using the same case and dataset as they are in the example but creating a dummy data for my implementation of ADVI on the model using

with model: 
     theta = pm.Dirichlet("thetas", a=alpha, shape=(D, K))
     phi = pm.Dirichlet("phis", a=beta, shape=(K, V))
     doc = pm.DensityDist('doc', log_lda(theta,phi), observed=data)   

def log_lda(theta,phi):
    ll = pm.math.logsumexp(t.log(theta) + t.log(phi.T), axis =1)
    return ll

I get an error:
ValueError: Input dimension mis-match. (input[0].shape[0] = 3, input[1].shape[0] = 2)

Any insight would be helpful. Thanks.

It’s throwing an error because in t.log(theta) + t.log(phi.T), theta and phi.T has a different shape. You need to broadcast them both to the same shape first. In the example, there is an indexing to do that, and similarly you need to write the function to take observed as input also (for indexing)

Any suggestions on how I can do it. Because what I understand from the likelihood equation is to sum over all topics and over all the words in the documents. So, θd,k would be of dimension(D,1) and βk,w should be (V,1) . Am I correct? Should I then add zeroes to βk,w to make it equal dimensioned as θd,k or any other way?

You should understand it as sum over all the (latent) topics, as the word is the actual observed. So the general idea is that for each specific observed word, use the observed (a tokenized index) to index across all topics, and sum over the topics. And you do that for all the observed words, then sum over the resulting vector.

Thanks.

@junpenglao I am trying to implement the LDA model for a dataset from Scikit learn. For a dummy dataset my model works fine but in this case it gives MemoryError: None .

My code is :

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import pymc3 as pm, theano.tensor as t
from theano import shared

dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data

no_features = 1000

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

# we have sparse dataset. It's better to have dence batch so that all words accure there
#minibatch_size = 128

# defining minibatch
#doc_t_minibatch = pm.Minibatch(tf.toarray(), minibatch_size)
#doc_t = shared(tf.toarray()[:minibatch_size])

K = 20 # number of topics
V = 1000 # number of words

alpha = np.ones((1, K))
beta = np.ones((1, V))
model = pm.Model()
(D, W) = tf.shape
        
with model: 
    theta = pm.Dirichlet("thetas", a=alpha, shape=(D, K))
    phi = pm.Dirichlet("phis", a=beta, shape=(K, V))
    z = pm.Categorical("zx", p=theta, shape=(W,D))
    w = pm.Categorical("wx", 
                       p=t.reshape(phi[z.T], (D*W, V)), 
                       observed=tf.reshape(D*W))
    
   
with model:    
    tr = pm.sample(1000,chains = 1)
    pm.plots.traceplot(tr, ['thetas','phis']);

Can you please help me with this? What could go wrong?

Looks like you dont have enough memory. Would you be able to access a machine with more memory?

But when i redefine the model for ADVI, it is working.

You shouldnt use ADVI as it doesnt work when you have discrete variable in your model.

For ADVI I tried using :

def log_lda(theta,phi):
    def ll_lda(value):  
        dixs, vixs = value.nonzero()
        vfreqs = value[dixs, vixs]
        ll =vfreqs* pm.math.logsumexp(t.log(theta[dixs]) + t.log(phi.T[vixs]), axis = 1).ravel()
        return t.sum(ll) 
    return ll_lda

with model: 
    theta = pm.Dirichlet("thetas", a=alpha, shape=(D, K))
    phi = pm.Dirichlet("phis", a=beta, shape=(K, V))
    doc = pm.DensityDist('doc', log_lda(theta,phi), observed=doc_t)

And sampling from this model gives you memory error as well?

No it doesn’t give me memory error while sampling.

1 Like