Dealing with Unbalanced Data with Minibatches


#1

Hello,

I’m new to pymc and just looking for some guidance.

I’m currently attempting to build a simple logistic regression to predict the probability of a good (y=0) or bad (y=1) outcome. However my target is highly imbalanced; proportions are 2% - 1’s and 98% - 0’s, and my dataset is quite large: 1.5 millions rows and 50 predictors. Running NUTS takes approximately 6 hours on default settings (1000 samples + 500 burn-in). I can reduce this down to 2 hours if I use ADVI, but is still not that desirable. I’ve therefore been looking at using mini-batches to deal with the large dataset, but I’m not sure how I can use it correctly with the high imbalance in the target classes. I’ve done some initial testing, and the only way I can achieve a sensible result is if I have a very large batch size (in the order of 50,000) - which I guess defeats the purpose of using mini-batches.

I’ve also tried down-sampling my dataset, which gives me ‘good’ performance in regards to ranking (e.g. AUROC), however the model becomes highly mis-calibrated as a result.

Just wondering what might be my best way forward here.


#2

Maybe it is possible to create better minibatch that always contains 0s and 1s. Following the docstring in minibatch:

To be more concrete about how we get minibatch, here is a demo

  1. create shared variable >>> shared = theano.shared(data)

  2. create random slice of size 10 >>> ridx = pm.tt_rng().uniform(size=(10,), low=0, high=data.shape[0]-1e-10).astype(‘int64’)

  3. take that slice >>> minibatch = shared[ridx]

That’s done. Next you can use this minibatch somewhere else. You can see that implementation does not require fixed shape for shared variable. Feel free to use that if needed.

The idea is to construct the ridx better so that it always index to a batch with around 2% of 1s and 98% or 0s. For example with a batch size of 100 and always at least 2 cases is 1s.


#3

Thanks @junpenglao, I will try and have a go at implementing your suggestion and report back here, but it looks like it will do the trick.


#4

In the interests of continuity, I just wanted to keep this discussion going if anyone else find this useful. So I’ve taken a stab at defining a new minibatch that deals with imbalance data and this is what I’ve come up with.

def sample_imbalancedix(ys, batch_size=500):
    """
    Attempts to randomly sample an imbalanced dataset while maintaining
    the ratio between minority and majority classes.
    """
    # what proportion of the total batch size will be the minority class?
    im_size = int(ys.sum()/len(ys)*batch_size)
    
    # group the indices for the two classes
    ix_ones = np.where(ys > 0)[0]
    ix_zeros = np.where(ys == 0)[0]
    
    # sample from these index groups according to proportion defined above
    i0_sample = np.random.choice(ix_ones, im_size)
    i1_sample = np.random.choice(ix_zeros, batch_size-im_size)
    
    # return the join of both index samples
    return np.concatenate([i0_sample, i1_sample])

# Generator that returns mini-batches in each iteration
def create_minibatches(X, Y, batch_size):
    """
    Shamelessly ripped from:
    https://docs.pymc.io/notebooks/constant_stochastic_gradient.html
    """
    while True:
        # Return random data samples of set size 100 each iteration
        ixs = sample_imbalancedix(Y, batch_size)
        yield (X[ixs], Y[ixs])

I’ve yet to test this, as I haven’t had access to my work computer, but I just wanted to run it by the experts here to check that I’m on the right track?


#5

Just to continue my progress here, I’ve only just had time again to come back to this problem, and while I’m naively confident that the above proportional indexing is working with the create_minibatches generator function, I’m entirely lost as to how this connects with the rest of the model specification and subsequent approximation with ADVI.

What I’m lost with is how the shared variables connect to my defined minibatches that are passed to pm.ADVI. As is, there is no difference in approximation time between using the minibatches or not. Are my arguments to the approximation function correct? Also, am I creating my minibatches correctly; i.e. can I sample my X and Y data concurrently, or do these need to be two separate functions?

Code posted below.


import pymc3 as pm
import theano.tensor as tt
import theano
import numpy as np
from six.moves import zip

### --------------------------
### data import and pre-processing here
### X has been z-scored
### --------------------------

X_train, X_test, y_train, y_test = train_test_split(features_zscored, target, test_size=0.33)

# define minibatches
minibatches = create_minibatches(X_train, y_train, 500)
total_size, n_betas = X_train.shape

# set as shared variables
X_tt = theano.shared(X_train.astype(np.float64))
y_tt = theano.shared(y_train.astype(np.float64))

# specify GLM
with pm.Model() as model:
    # intercept or bias -> could actually make this 0
    alpha = pm.Normal('alpha', mu=0, sd=1)
    # define priors on feature weights
    betas = pm.Laplace("betas", 0, b=0.1, shape=n_betas, testval=np.random.normal())
    # linear model exponent
    theta = tt.dot(X_tt, betas) + alpha
    # predicted probability
    p = tt.nnet.sigmoid(theta)
    # predicted outcome or likelihood
    outcome= pm.Bernoulli('outcome', p=p, observed=y_tt, total_size=total_size)

# approximate posterior
with model:
    advi_kwargs = {
            'minibatches': zip(minibatches),
            'total_size': total_size,
            }
    inference = pm.ADVI(**advi_kwargs)
    approx = pm.fit(50000, method=inference)

# sample approximation
trace = approx.sample(draws=1000)

#6

In your code above you are not using minibatch - you are using the whole training set.

Try something like:

X_tt, y_tt = create_minibatches(X_train, y_train, 500)

#7

So that doesn’t work, because of the yield.

>>> X_tt, y_tt = create_minibatches(X_train, y_train, 500)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-33-7afc44de8e7f> in <module>()
----> 1 X_tt, y_tt = create_yminibatches(X_train, y_train, 500)

ValueError: too many values to unpack (expected 2)

A messy hack of mini-batching X and Y separately works…

def create_y_minibatches(X, Y, batch_size):
    while True:
        # Return random data samples of set size 100 each iteration
        ixs = sample_imbalancedix(Y, batch_size)
        yield Y[ixs]
    
def create_x_minibatches(X, Y, batch_size):
    while True:
        # Return random data samples of set size 100 each iteration
        ixs = sample_imbalancedix(Y, batch_size)
        yield X[ixs]

X_tt = create_x_minibatches(X_train, y_train, 500)
y_tt = create_y_minibatches(X_train, y_train, 500)

But then when I try and initialise the model specification, it doesn’t know how to convert the generator type to tensor type (I think). I get a bunch of errors with the last line of the trace saying:

with pm.Model() as model:
     ....
    # etc.

AsTensorError                             Traceback (most recent call last)
<ipython-input-37-e1608dba4e01> in <module>()
      5     betas = pm.Laplace("betas", 0, b=0.1, shape=n_betas, testval=np.random.normal())
      6     # linear model exponent
----> 7     theta = pm.math.dot(X_tt, betas) + alpha
      8     # predicted probability
      9     p = tt.nnet.sigmoid(theta)

AsTensorError: ('Cannot convert <generator object create_x_minibatches at 0x7f0a71599fc0> to TensorType', <class 'generator'>)

#8

A somewhat tedious way to do so is create a minibatch for 0s and 1s separately and concatenate them together using tt.concatenate


#9

I guess what I’m a bit confused about is that compared to the code on the SGD examples page (ref: In [11] - [14]) my minibatch generator and consequent usage of it isn’t too dissimilar (if at all) from what’s written there.

https://docs.pymc.io/notebooks/constant_stochastic_gradient.html

In these examples, the minibatches aren’t used in the model specification, but rather in the definition of the step method. The only thing that I can think of is that X_tt and y_tt (ref: my code) are only used in the model specification to define the size of ‘X’ and ‘y’, and then are replaced by the minibatches in the approximation/sampling?


#10

Right I see, the usage is a bit different. Using the create_minibatches as defined above, you can try using more_replacements argument in fit like approx = pm.fit(more_replacements={X_tt: minibatch_x, y_tt: minibatch_y}, ...)


#11

Yep so I can see what you’re saying. So I’ve tried this:

from six.moves import zip

with model:
    inference = pm.ADVI(minibatches=zip(X_minibatch, y_minibatch), total_size=total_size)    
    approx = pm.fit(50000, method=inference, more_replacements={X_tt:X_minibatch, y_tt:y_minibatch})

which throws a type error:

TypeError: Cannot convert Type Generic (of Variable <Generic>) into Type TensorType(float64, vector). You can try to manually convert <Generic> into a TensorType(float64, vector).

I thought this might be because I haven’t explicitly converted my inputs, X_train and y_train to float64, so I’ve changed my code to do this:

# convert all types to float for theano
X_tt = theano.shared(pm.floatX(X_train))
y_tt = theano.shared(pm.floatX(y_train))

minibatch_x = create_x_minibatches(pm.floatX(X_train), pm.floatX(y_train), 500)
minibatch_y = create_y_minibatches(pm.floatX(X_train), pm.floatX(y_train), 500)

However, I still get the same type error when I try to run the fitting.


#12

pm.ADVI does not take minibatch and total_size and input kwarg, it would just be ignored.

The correct syntax is along this line:

Xbatch, ybatch = create_minibatches(X_train, y_train, 500)
X_tt = theano.shared(X_train.astype(np.float64))
y_tt = theano.shared(y_train.astype(np.float64))
...
# define the model using X_tt and y_tt
with model:
    inference = pm.ADVI()    
    approx = inference.fit(50000, more_replacements={X_tt: Xbatch, y_tt: ybatch})

#13

So this still results in the same error, so I can only guess that something is going wrong with the conversion between the shared variables and the minibatches. My feeling is that maybe my sample_imblancedix function is the issue… although it seems fine when I’ve tested it in isolation.

I guess my next question is what causes something to be ‘Type Generic’? I’ve attempted to force the minibatches to return type np.float64, passing it the same type, etc. Somewhere along the way though, something is getting cast to a Generic type.


#14

I would guess so as well - try replacing the function with the original implementation in the notebook, then try rewriting your function in theano using eg:

i0_sample  = pm.tt_rng().uniform(size=(10,), low=0, high=Y.shape[0]-1e-10).astype('int64')