Using dataset as observed in likelihood


I am working with some huge data for LDA. I want to use all the data at once for getting my posterior inference. I am trying to use theano.sparse.as_sparse_variable so that I can give that as observed in my log likelihood calculation.

theano_sparse_data = theano.sparse.as_sparse_variable(sparse_data)  
def log_lda(theta,phi):
            def ll_lda(value):  
                 dixs, vixs = value.nonzero()
                 vfreqs = value[dixs, vixs]
                 ll =vfreqs* pm.math.logsumexp(t.log(theta[dixs]) + t.log(phi.T[vixs]), axis = 1).ravel()
                 return t.sum(ll) 
            return ll_lda

with model: 
     theta = pm.Dirichlet("thetas", a=alpha, shape=(D, K))
     phi = pm.Dirichlet("phis", a=beta, shape=(K, V))
     doc = pm.DensityDist('doc', log_lda(theta,phi), observed=theano_sparse_data) 

I am trying to find way to use something like this so that I can use all my data at once.

P.S: I cant convert the matrix to dense because it runs into Memory Error.

Help much needed.

Thanks in advance.

Try supplying the observed value as a 3 column numpy array, with first column and second column being the indexes and the last column being the value:

def log_lda(theta, phi, value):
    ll =value[:, 2] * pm.math.logsumexp(t.log(theta[value[:, 0]]) + t.log(phi.T[value[:, 1]), axis = 1).ravel()
    return t.sum(ll) 

with model:
    doc = pm.DensityDist('doc', log_lda, observed=dict(theta=theta, phi=phi, value=sparse_data))

Also, dont use theano sparse as the support is limited and it doesn allows you to do everything.


Thanks for the reply. It worked, only that I modified a little in way the observed data is feeded.

Again thanks a ton!:slight_smile:

1 Like