Hi,
I am dealing with a very large dataset and trying to run Latent Dirichlet Allocation using ADVI on it. I used scikit learn CountVectorizer
to prepare the data and the resultant sparse matrix,which is seen as observed data in the likelihood. But I can’t use toarray
to get the dense version of the sparse matrix because it throws memory error. So I tried to use a for loop so that I can run same model for parts of data. But the problem arises while sampling. How do i combine all the traces, but before that how to save all the trace objects.
'''
For minibatch
'''
partitions = 25
index_number = 0
splits = int(tf26.shape[0]/partitions)
#tr_coll = np.array((25))
tr_coll = []
tf26 = tf26
for i in range(partitions):
indices = np.arange(index_number,index_number+splits)
data = tf26[indices,:].toarray()
(D, W) = data.shape #In this case I see V and W of same length
LDA_output = theano.shared(data) #Throws error because data is a sparse matrix
minibatch_data = pm.Minibatch(data, batch_size=1000)
with model:
theta = pm.Dirichlet("thetas_%d" %i, a=alpha, shape=(D, K))
phi = pm.Dirichlet("phis_%d" %i, a=beta, shape=(K, V))
doc = pm.DensityDist("doc_%d" %i, log_lda(theta,phi), observed=minibatch_data)
with model:
inference = pm.ADVI()
approx = pm.fit(n=1000,method=inference,more_replacements = {LDA_output:minibatch_data})
tr = approx.sample(draws=500)
index_number = index_number+splits+1
tr_coll[i] = tr
Any suggestions could be of great help!
Thanks