Creating Unshuffled/Sequential Minibatches

nmrobert · April 22, 2018, 5:09pm

Hi,

real basic question here - how do I create a minibatch which is not randomly drawn from the original data?

Suppose I have a 1 dimensional vector of floats of length 1000, and I would like to break this into 10 batches of 100, preserving the original order - can this be done using the typical minibatch constructor?

I attempted to build a new generator for data like this:

def batch(l, group_size):
    for i in range(0, len(l), group_size):
        yield l[i:i+group_size]
                
data = pm.generator(batch(X, 100))

where X is my (1000,1) dimensional vector, yielding batches of (100,1). However, while the pm.Minibatch(X, 100) will compute successfully (even though it doesn’t do what I want), this custom generator immediately explodes with the following error:

Average Loss = 28,149:   0%|                                                                 | 0/10000 [00:00<?, ?it/s]

Apply node that caused the error: GeneratorOp{generator=<pymc3.data.GeneratorAdapter object at 0x000001D488258940>}()
Toposort index: 0
Inputs types: []
Inputs shapes: []
Inputs strides: []
Inputs values: []
Inputs type_num: []
Outputs clients: [[Elemwise{sub,no_inplace}(GeneratorOp{generator=<pymc3.data.GeneratorAdapter object at 0x000001D488258940>}.0, InplaceDimShuffle{x,x}.0), Shape_i{1}(GeneratorOp{generator=<pymc3.data.GeneratorAdapter object at 0x000001D488258940>}.0), Shape_i{0}(GeneratorOp{generator=<pymc3.data.GeneratorAdapter object at 0x000001D488258940>}.0), Elemwise{Sub}[(0, 0)](GeneratorOp{generator=<pymc3.data.GeneratorAdapter object at 0x000001D488258940>}.0, InplaceDimShuffle{x,x}.0)]]

Debugprint of the apply node: 
GeneratorOp{generator=<pymc3.data.GeneratorAdapter object at 0x000001D488258940>} [id A] <TensorType(float64, matrix)> ''

So it’s probably not quite right Can anybody offer guidance on how to make a minibatch behave the way I want? Thanks!

junpenglao · April 22, 2018, 6:21pm

Here is some example code, hope you find helpful:

def gen1(batchsize, totalsize):
    i = 0
    k = totalsize // batchsize
    while True:
        yield np.arange(batchsize) + batchsize * i
        i += 1
        # reset counter
        if i == k:
            i = 0


genvar = pm.generator(gen1(100, 1000))

X = np.random.randn(1000, 5)
beta = np.random.randn(5, 1)
y = X.dot(beta) + np.random.randn(1000, 1) * .75

Xshared = theano.shared(X)
yshared = theano.shared(y)
with pm.Model() as m:
    b = pm.Normal('b', 0., 100., shape=(5, 1))
    sd = pm.HalfNormal('sd', 5.)
    yhat = tt.dot(Xshared[genvar], b)
    obs = pm.Normal('y', yhat, sd, observed=yshared[genvar])
    approx = pm.fit()

nmrobert · April 22, 2018, 6:35pm

Hi Junpenglao, this appears to do exactly what I wanted. Thank you for your help! So the key here is that we’re generating the indices for our minibatches, not the actual slices of data themselves, right?

junpenglao · April 22, 2018, 6:50pm

You can also wrap the data in the generator to get the actual slice of data, but I prefer this way a little bit more.

Topic		Replies	Views
How to make Minibatch for multi-dimensional data? Questions	10	2355	September 17, 2020
Bug Minibatching with CustomDist? version agnostic	5	27	September 13, 2024
Dealing with Unbalanced Data with Minibatches Questions	16	3504	February 4, 2020
Minibatch Giving Inf Loss v5 variational_inferenc , modeling	4	20	July 24, 2024
Inference with multi-dimensional data and minibatches Questions	0	304	March 12, 2020

Creating Unshuffled/Sequential Minibatches

Related topics