I’m new to pymc and just looking for some guidance.
I’m currently attempting to build a simple logistic regression to predict the probability of a good (y=0) or bad (y=1) outcome. However my target is highly imbalanced; proportions are 2% - 1’s and 98% - 0’s, and my dataset is quite large: 1.5 millions rows and 50 predictors. Running NUTS takes approximately 6 hours on default settings (1000 samples + 500 burn-in). I can reduce this down to 2 hours if I use ADVI, but is still not that desirable. I’ve therefore been looking at using mini-batches to deal with the large dataset, but I’m not sure how I can use it correctly with the high imbalance in the target classes. I’ve done some initial testing, and the only way I can achieve a sensible result is if I have a very large batch size (in the order of 50,000) - which I guess defeats the purpose of using mini-batches.
I’ve also tried down-sampling my dataset, which gives me ‘good’ performance in regards to ranking (e.g. AUROC), however the model becomes highly mis-calibrated as a result.
Just wondering what might be my best way forward here.
create random slice of size 10 >>> ridx = pm.tt_rng().uniform(size=(10,), low=0, high=data.shape[0]-1e-10).astype(‘int64’)
take that slice >>> minibatch = shared[ridx]
That’s done. Next you can use this minibatch somewhere else. You can see that implementation does not require fixed shape for shared variable. Feel free to use that if needed.
The idea is to construct the ridx better so that it always index to a batch with around 2% of 1s and 98% or 0s. For example with a batch size of 100 and always at least 2 cases is 1s.
In the interests of continuity, I just wanted to keep this discussion going if anyone else find this useful. So I’ve taken a stab at defining a new minibatch that deals with imbalance data and this is what I’ve come up with.
def sample_imbalancedix(ys, batch_size=500):
"""
Attempts to randomly sample an imbalanced dataset while maintaining
the ratio between minority and majority classes.
"""
# what proportion of the total batch size will be the minority class?
im_size = int(ys.sum()/len(ys)*batch_size)
# group the indices for the two classes
ix_ones = np.where(ys > 0)[0]
ix_zeros = np.where(ys == 0)[0]
# sample from these index groups according to proportion defined above
i0_sample = np.random.choice(ix_ones, im_size)
i1_sample = np.random.choice(ix_zeros, batch_size-im_size)
# return the join of both index samples
return np.concatenate([i0_sample, i1_sample])
# Generator that returns mini-batches in each iteration
def create_minibatches(X, Y, batch_size):
"""
Shamelessly ripped from:
https://docs.pymc.io/notebooks/constant_stochastic_gradient.html
"""
while True:
# Return random data samples of set size 100 each iteration
ixs = sample_imbalancedix(Y, batch_size)
yield (X[ixs], Y[ixs])
I’ve yet to test this, as I haven’t had access to my work computer, but I just wanted to run it by the experts here to check that I’m on the right track?
Just to continue my progress here, I’ve only just had time again to come back to this problem, and while I’m naively confident that the above proportional indexing is working with the create_minibatches generator function, I’m entirely lost as to how this connects with the rest of the model specification and subsequent approximation with ADVI.
What I’m lost with is how the shared variables connect to my defined minibatches that are passed to pm.ADVI. As is, there is no difference in approximation time between using the minibatches or not. Are my arguments to the approximation function correct? Also, am I creating my minibatches correctly; i.e. can I sample my X and Y data concurrently, or do these need to be two separate functions?
Code posted below.
import pymc3 as pm
import theano.tensor as tt
import theano
import numpy as np
from six.moves import zip
### --------------------------
### data import and pre-processing here
### X has been z-scored
### --------------------------
X_train, X_test, y_train, y_test = train_test_split(features_zscored, target, test_size=0.33)
# define minibatches
minibatches = create_minibatches(X_train, y_train, 500)
total_size, n_betas = X_train.shape
# set as shared variables
X_tt = theano.shared(X_train.astype(np.float64))
y_tt = theano.shared(y_train.astype(np.float64))
# specify GLM
with pm.Model() as model:
# intercept or bias -> could actually make this 0
alpha = pm.Normal('alpha', mu=0, sd=1)
# define priors on feature weights
betas = pm.Laplace("betas", 0, b=0.1, shape=n_betas, testval=np.random.normal())
# linear model exponent
theta = tt.dot(X_tt, betas) + alpha
# predicted probability
p = tt.nnet.sigmoid(theta)
# predicted outcome or likelihood
outcome= pm.Bernoulli('outcome', p=p, observed=y_tt, total_size=total_size)
# approximate posterior
with model:
advi_kwargs = {
'minibatches': zip(minibatches),
'total_size': total_size,
}
inference = pm.ADVI(**advi_kwargs)
approx = pm.fit(50000, method=inference)
# sample approximation
trace = approx.sample(draws=1000)
>>> X_tt, y_tt = create_minibatches(X_train, y_train, 500)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-33-7afc44de8e7f> in <module>()
----> 1 X_tt, y_tt = create_yminibatches(X_train, y_train, 500)
ValueError: too many values to unpack (expected 2)
A messy hack of mini-batching X and Y separately works…
def create_y_minibatches(X, Y, batch_size):
while True:
# Return random data samples of set size 100 each iteration
ixs = sample_imbalancedix(Y, batch_size)
yield Y[ixs]
def create_x_minibatches(X, Y, batch_size):
while True:
# Return random data samples of set size 100 each iteration
ixs = sample_imbalancedix(Y, batch_size)
yield X[ixs]
X_tt = create_x_minibatches(X_train, y_train, 500)
y_tt = create_y_minibatches(X_train, y_train, 500)
But then when I try and initialise the model specification, it doesn’t know how to convert the generator type to tensor type (I think). I get a bunch of errors with the last line of the trace saying:
with pm.Model() as model:
....
# etc.
AsTensorError Traceback (most recent call last)
<ipython-input-37-e1608dba4e01> in <module>()
5 betas = pm.Laplace("betas", 0, b=0.1, shape=n_betas, testval=np.random.normal())
6 # linear model exponent
----> 7 theta = pm.math.dot(X_tt, betas) + alpha
8 # predicted probability
9 p = tt.nnet.sigmoid(theta)
AsTensorError: ('Cannot convert <generator object create_x_minibatches at 0x7f0a71599fc0> to TensorType', <class 'generator'>)
I guess what I’m a bit confused about is that compared to the code on the SGD examples page (ref: In [11] - [14]) my minibatch generator and consequent usage of it isn’t too dissimilar (if at all) from what’s written there.
In these examples, the minibatches aren’t used in the model specification, but rather in the definition of the step method. The only thing that I can think of is that X_tt and y_tt (ref: my code) are only used in the model specification to define the size of ‘X’ and ‘y’, and then are replaced by the minibatches in the approximation/sampling?
Right I see, the usage is a bit different. Using the create_minibatches as defined above, you can try using more_replacements argument in fit like approx = pm.fit(more_replacements={X_tt: minibatch_x, y_tt: minibatch_y}, ...)
Yep so I can see what you’re saying. So I’ve tried this:
from six.moves import zip
with model:
inference = pm.ADVI(minibatches=zip(X_minibatch, y_minibatch), total_size=total_size)
approx = pm.fit(50000, method=inference, more_replacements={X_tt:X_minibatch, y_tt:y_minibatch})
which throws a type error:
TypeError: Cannot convert Type Generic (of Variable <Generic>) into Type TensorType(float64, vector). You can try to manually convert <Generic> into a TensorType(float64, vector).
I thought this might be because I haven’t explicitly converted my inputs, X_train and y_train to float64, so I’ve changed my code to do this:
# convert all types to float for theano
X_tt = theano.shared(pm.floatX(X_train))
y_tt = theano.shared(pm.floatX(y_train))
minibatch_x = create_x_minibatches(pm.floatX(X_train), pm.floatX(y_train), 500)
minibatch_y = create_y_minibatches(pm.floatX(X_train), pm.floatX(y_train), 500)
However, I still get the same type error when I try to run the fitting.
So this still results in the same error, so I can only guess that something is going wrong with the conversion between the shared variables and the minibatches. My feeling is that maybe my sample_imblancedix function is the issue… although it seems fine when I’ve tested it in isolation.
I guess my next question is what causes something to be ‘Type Generic’? I’ve attempted to force the minibatches to return type np.float64, passing it the same type, etc. Somewhere along the way though, something is getting cast to a Generic type.
I would guess so as well - try replacing the function with the original implementation in the notebook, then try rewriting your function in theano using eg: