Minibatch for a large dataset ADVI


#1

Hi,

I am dealing with a very large dataset and trying to run Latent Dirichlet Allocation using ADVI on it. I used scikit learn CountVectorizer to prepare the data and the resultant sparse matrix,which is seen as observed data in the likelihood. But I can’t use toarray to get the dense version of the sparse matrix because it throws memory error. So I tried to use a for loop so that I can run same model for parts of data. But the problem arises while sampling. How do i combine all the traces, but before that how to save all the trace objects.

'''
For minibatch 
'''
partitions = 25
index_number = 0
splits = int(tf26.shape[0]/partitions)
#tr_coll = np.array((25))
tr_coll = []
tf26 = tf26

for i in range(partitions):
    indices = np.arange(index_number,index_number+splits)
    data = tf26[indices,:].toarray()
    
    (D, W) = data.shape #In this case I see V and W of same length  
    
    LDA_output = theano.shared(data) #Throws error because data is a sparse matrix 
    minibatch_data = pm.Minibatch(data, batch_size=1000)
    
    with model: 
        theta = pm.Dirichlet("thetas_%d" %i, a=alpha, shape=(D, K))
        phi = pm.Dirichlet("phis_%d" %i, a=beta, shape=(K, V))
        doc = pm.DensityDist("doc_%d" %i, log_lda(theta,phi), observed=minibatch_data)   
    with model:    
        inference = pm.ADVI()
        approx = pm.fit(n=1000,method=inference,more_replacements = {LDA_output:minibatch_data})
    tr = approx.sample(draws=500)
    
    index_number = index_number+splits+1
    
    tr_coll[i] =  tr

Any suggestions could be of great help!
Thanks


#2

Try saving the traces using using pm.save_trace and pm.load_trace. See also Saving and Loading GP model in PYMC3


#3

The problem in my case is that rerunning the model is taking a very long time. So if I use pm.save_trace I would have to do that. And that is not quite feasible for me because of my constraints.