Minibatch for a large dataset ADVI

gaddamanil16 · September 5, 2018, 2:49pm

Hi,

I am dealing with a very large dataset and trying to run Latent Dirichlet Allocation using ADVI on it. I used scikit learn CountVectorizer to prepare the data and the resultant sparse matrix,which is seen as observed data in the likelihood. But I can’t use toarray to get the dense version of the sparse matrix because it throws memory error. So I tried to use a for loop so that I can run same model for parts of data. But the problem arises while sampling. How do i combine all the traces, but before that how to save all the trace objects.

'''
For minibatch 
'''
partitions = 25
index_number = 0
splits = int(tf26.shape[0]/partitions)
#tr_coll = np.array((25))
tr_coll = []
tf26 = tf26

for i in range(partitions):
    indices = np.arange(index_number,index_number+splits)
    data = tf26[indices,:].toarray()
    
    (D, W) = data.shape #In this case I see V and W of same length  
    
    LDA_output = theano.shared(data) #Throws error because data is a sparse matrix 
    minibatch_data = pm.Minibatch(data, batch_size=1000)
    
    with model: 
        theta = pm.Dirichlet("thetas_%d" %i, a=alpha, shape=(D, K))
        phi = pm.Dirichlet("phis_%d" %i, a=beta, shape=(K, V))
        doc = pm.DensityDist("doc_%d" %i, log_lda(theta,phi), observed=minibatch_data)   
    with model:    
        inference = pm.ADVI()
        approx = pm.fit(n=1000,method=inference,more_replacements = {LDA_output:minibatch_data})
    tr = approx.sample(draws=500)
    
    index_number = index_number+splits+1
    
    tr_coll[i] =  tr

Any suggestions could be of great help!
Thanks

junpenglao · September 7, 2018, 9:20am

Try saving the traces using using pm.save_trace and pm.load_trace. See also Saving and Loading GP model in PYMC3

gaddamanil16 · September 7, 2018, 7:39pm

The problem in my case is that rerunning the model is taking a very long time. So if I use pm.save_trace I would have to do that. And that is not quite feasible for me because of my constraints.

Topic		Replies	Views
Minibatch when latent variable size depends on data dimension Questions	2	685	February 8, 2019
Advi_minibatch is deprecated? Questions	4	1295	September 20, 2018
Error with trace ADVI Questions	6	878	September 18, 2018
Saving ADVI results and reloading Questions	6	2023	April 15, 2018
Large Scale Factor Analysis with minibatch ADVI Questions	9	2966	August 20, 2017

Minibatch for a large dataset ADVI

Related topics