Minibatch for a large dataset ADVI



I am dealing with a very large dataset and trying to run Latent Dirichlet Allocation using ADVI on it. I used scikit learn CountVectorizer to prepare the data and the resultant sparse matrix,which is seen as observed data in the likelihood. But I can’t use toarray to get the dense version of the sparse matrix because it throws memory error. So I tried to use a for loop so that I can run same model for parts of data. But the problem arises while sampling. How do i combine all the traces, but before that how to save all the trace objects.

For minibatch 
partitions = 25
index_number = 0
splits = int(tf26.shape[0]/partitions)
#tr_coll = np.array((25))
tr_coll = []
tf26 = tf26

for i in range(partitions):
    indices = np.arange(index_number,index_number+splits)
    data = tf26[indices,:].toarray()
    (D, W) = data.shape #In this case I see V and W of same length  
    LDA_output = theano.shared(data) #Throws error because data is a sparse matrix 
    minibatch_data = pm.Minibatch(data, batch_size=1000)
    with model: 
        theta = pm.Dirichlet("thetas_%d" %i, a=alpha, shape=(D, K))
        phi = pm.Dirichlet("phis_%d" %i, a=beta, shape=(K, V))
        doc = pm.DensityDist("doc_%d" %i, log_lda(theta,phi), observed=minibatch_data)   
    with model:    
        inference = pm.ADVI()
        approx =,method=inference,more_replacements = {LDA_output:minibatch_data})
    tr = approx.sample(draws=500)
    index_number = index_number+splits+1
    tr_coll[i] =  tr

Any suggestions could be of great help!


Try saving the traces using using pm.save_trace and pm.load_trace. See also Saving and Loading GP model in PYMC3


The problem in my case is that rerunning the model is taking a very long time. So if I use pm.save_trace I would have to do that. And that is not quite feasible for me because of my constraints.