Running with minibatches (memory constraints)

Hi all,

I’m looking for some help understanding how ‘Minibatches’ actually work. I was under the impression that it just creates give indicators/identifiers for subsets of the training data, update the estimates of the parameters using this subset (I’m using MeanField approx with SGD as the optimiser), and continue onto the next set? That way, RAM requirements wouldn’t really scale with increased training set size because there’s no real need to create anything (in memory) that’s much larger than the original set?

Instead, RAM usage goes up as I add more training samples and changing the minibatch size only seems to offer minimal control over it?

I’m trying to implement a model like this one in the documentation, where it includes replacements = {doc_t: doc_t_minibatch}.

I’m confused and can’t really work it out from the documentation, so any help would be greatly appreciated!

Thanks a bunch

Just as a bit more info, it’s supposed to be a LLDA model that looks something like this;

import pymc3 as pm
import numpy as np
import scipy.sparse as sps
import theano.tensor as tt
from pymc3.distributions.transforms import t_stick_breaking
from theano import shared
import theano
theano.config.compute_test_value = ‘off’

class LLDA_model_pymc3:
""" takes in sparse matrix of feature vectors and a dataframe of labels “”"

def __init__(self, word_counts, feature_names, labels):
    self.wordCounts = word_counts
    self.feature_names = feature_names
    self.labels = labels
    self.nTopics = labels.shape[1]  # K
    self.vocabLen = word_counts.shape[1] # V
    self.nDocs = word_counts.shape[0]    # D
    self.nTokens = np.sum(word_counts[word_counts.nonzero()])

def build_pymc3_model(self, minibatchSize=200):
    self.minibatchSize = minibatchSize
    
    def logp_lda_doc(beta, theta):
        """Returns the log-likelihood function for given documents. 
        K : number of topics in the model
        V : number of words (size of vocabulary)
        D : number of documents (in a mini-batch)
        Parameters
        ----------
        beta : tensor (K x V)
            Word distributions. 
        theta : tensor (D x K)
            Topic distributions for documents (set as strong Dirichlet for supervised model) 
        """
        def docLiklihoodFunction(docs):
            documentIndex, vocabIndex = docs.nonzero()
            vocabFreqs = docs[documentIndex, vocabIndex]
            docLikelihood = vocabFreqs * pm.math.logsumexp(
                tt.log(theta[documentIndex]) + tt.log(beta.T[vocabIndex]), axis=1).ravel()

            # per-word log-likelihood * num of tokens in the whole dataset
            return tt.sum(docLikelihood) / tt.sum(vocabFreqs) * self.nTokens 

        return docLiklihoodFunction
    
    self.doc_t_minibatch = pm.Minibatch(self.wordCounts.toarray(), minibatchSize)
    self.doc_t = shared(self.wordCounts.toarray()[:minibatchSize], borrow=True)
    self.topic_t = shared(np.asarray(self.labels)[:minibatchSize], borrow=True)
    self.topic_t_minibatch = pm.Minibatch(np.asarray(self.labels), minibatchSize)

    with pm.Model() as model:
        beta = pm.Dirichlet('beta', a=pm.floatX((1.0 / self.nTopics) * np.ones((self.nTopics, self.vocabLen))),
                         shape=(self.nTopics, self.vocabLen), transform=t_stick_breaking(1e-9))
        doc = pm.DensityDist('doc', logp_lda_doc(beta, self.topic_t), observed=self.doc_t)
    
    self.model = model

def inference(self, n_steps = 10000, start_learn_rate = 0.1):
    try:
        self.model
    except: 
        print("No pymc model has been defined")
    else:
        n = start_learn_rate
        s = shared(n)
        def reduce_rate(a, h, i):
            s.set_value(n/((i/self.minibatchSize)+1)**.7)

        with self.model:
            approx = pm.MeanField()
            approx.scale_cost_to_minibatch = False
            inference = pm.KLqp(approx)

        inference.fit(n_steps, callbacks=[reduce_rate], obj_optimizer=pm.sgd(learning_rate=s),
                      total_grad_norm_constraint=200,
                      more_replacements={self.doc_t:self.doc_t_minibatch, self.topic_t:self.topic_t_minibatch})

        self.approx = approx
        
        samples = pm.sample_approx(approx, draws=100)
        self.vocab_samples = samples['beta'].mean(axis=0)
    
def print_top_words(self, n_top_words=10):
    try:
        self.vocab_samples
    except:
        print("Error, build model + perform inference first")
    else:
        for i in range(len(self.vocab_samples)):
            print(("Topic #%d: " % i) + " ".join([self.feature_names[j]
                for j in self.vocab_samples[i].argsort()[:-n_top_words - 1:-1]]))
        
def predictions(self, test_word_counts, softmax = True):
    
    def softmax(x):
        e_x = np.exp(x - np.max(x, axis=1)[:, None])
        return e_x/e_x.sum(axis=1)[:, None]
    
    try:
        self.vocab_samples
    except:
        print("Error, build model + perform inference first")
    else:
        predictions = test_word_counts.dot(self.vocab_samples.transpose())
        if softmax:
            predictions = softmax(predictions)
        return(predictions)`

Minibatch is indexing a subset of your training set - since the training set is already in the memory using Minibatch would not help for that end:

Using Minibatch usually improves the speed of gradient computation, thus makes your training goes faster.

Thanks for your reply :slight_smile:

I’m still not really understanding why the RAM requirements of my model scales so heavily with more training samples?

Say like my training set is 1000 samples (~1GB in memory), and my model (constructed with minibatches) takes another 2GB of memory.

If I bump my training size up to 100k samples (5GB), and use the exact same model (with the same minibatch size) then why would the model take up ~30GB? It wouldn’t take up that memory storing indexes for the minibatches, would it?

In the above code, the basic model stays the same and I’m not passing it more input at once (since minibatch size stays the same). So you’d expect the RAM requirements to not be too much more than the increase in training set size?

Sorry if these are stupid questions, but I’m just a bit confused!

Thanks again

I dont have a lot of experience with large dataset, maybe you can try profiling the memory use:
http://deeplearning.net/software/theano/tutorial/profiling.html
http://docs.pymc.io/notebooks/profiling.html

theano.config.profile = True 
theano.config.profile_memory = True 
model.profile(model.logpt).summary()

Ahhhhh, I know what’s happening now!

The casting of the sparse matrix is in the “build_pymc3_model” function that I was running.

self.doc_t_minibatch = pm.Minibatch(self.wordCounts.toarray(), minibatchSize)
self.doc_t = shared(self.wordCounts.toarray()[:minibatchSize], borrow=True)

wordCounts gets expanded inside the function, and that’s why the RAM is exploding. I should only cast the minibatch to dense as it’s needed. I’m doing it all and then setting indexes for the minibatches on that. Nothing to do with how PyMC3/minibatches work, just my own stupidity.

Thanks a lot for your help! Would have taken me a lot longer to realise without you :slight_smile:

1 Like