Creating Unshuffled/Sequential Minibatches

Hi,

real basic question here - how do I create a minibatch which is not randomly drawn from the original data?

Suppose I have a 1 dimensional vector of floats of length 1000, and I would like to break this into 10 batches of 100, preserving the original order - can this be done using the typical minibatch constructor?

I attempted to build a new generator for data like this:

def batch(l, group_size):
    for i in range(0, len(l), group_size):
        yield l[i:i+group_size]
                
data = pm.generator(batch(X, 100))

where X is my (1000,1) dimensional vector, yielding batches of (100,1). However, while the pm.Minibatch(X, 100) will compute successfully (even though it doesn’t do what I want), this custom generator immediately explodes with the following error:

Average Loss = 28,149:   0%|                                                                 | 0/10000 [00:00<?, ?it/s]

Apply node that caused the error: GeneratorOp{generator=<pymc3.data.GeneratorAdapter object at 0x000001D488258940>}()
Toposort index: 0
Inputs types: []
Inputs shapes: []
Inputs strides: []
Inputs values: []
Inputs type_num: []
Outputs clients: [[Elemwise{sub,no_inplace}(GeneratorOp{generator=<pymc3.data.GeneratorAdapter object at 0x000001D488258940>}.0, InplaceDimShuffle{x,x}.0), Shape_i{1}(GeneratorOp{generator=<pymc3.data.GeneratorAdapter object at 0x000001D488258940>}.0), Shape_i{0}(GeneratorOp{generator=<pymc3.data.GeneratorAdapter object at 0x000001D488258940>}.0), Elemwise{Sub}[(0, 0)](GeneratorOp{generator=<pymc3.data.GeneratorAdapter object at 0x000001D488258940>}.0, InplaceDimShuffle{x,x}.0)]]

Debugprint of the apply node: 
GeneratorOp{generator=<pymc3.data.GeneratorAdapter object at 0x000001D488258940>} [id A] <TensorType(float64, matrix)> ''   

So it’s probably not quite right :slight_smile: Can anybody offer guidance on how to make a minibatch behave the way I want? Thanks!

Here is some example code, hope you find helpful:

def gen1(batchsize, totalsize):
    i = 0
    k = totalsize // batchsize
    while True:
        yield np.arange(batchsize) + batchsize * i
        i += 1
        # reset counter
        if i == k:
            i = 0


genvar = pm.generator(gen1(100, 1000))

X = np.random.randn(1000, 5)
beta = np.random.randn(5, 1)
y = X.dot(beta) + np.random.randn(1000, 1) * .75

Xshared = theano.shared(X)
yshared = theano.shared(y)
with pm.Model() as m:
    b = pm.Normal('b', 0., 100., shape=(5, 1))
    sd = pm.HalfNormal('sd', 5.)
    yhat = tt.dot(Xshared[genvar], b)
    obs = pm.Normal('y', yhat, sd, observed=yshared[genvar])
    approx = pm.fit()

Hi Junpenglao, this appears to do exactly what I wanted. Thank you for your help! So the key here is that we’re generating the indices for our minibatches, not the actual slices of data themselves, right?

You can also wrap the data in the generator to get the actual slice of data, but I prefer this way a little bit more.