I’m new to pymc and just looking for some guidance.
I’m currently attempting to build a simple logistic regression to predict the probability of a good (y=0) or bad (y=1) outcome. However my target is highly imbalanced; proportions are 2% - 1’s and 98% - 0’s, and my dataset is quite large: 1.5 millions rows and 50 predictors. Running NUTS takes approximately 6 hours on default settings (1000 samples + 500 burn-in). I can reduce this down to 2 hours if I use ADVI, but is still not that desirable. I’ve therefore been looking at using mini-batches to deal with the large dataset, but I’m not sure how I can use it correctly with the high imbalance in the target classes. I’ve done some initial testing, and the only way I can achieve a sensible result is if I have a very large batch size (in the order of 50,000) - which I guess defeats the purpose of using mini-batches.
I’ve also tried down-sampling my dataset, which gives me ‘good’ performance in regards to ranking (e.g. AUROC), however the model becomes highly mis-calibrated as a result.
Just wondering what might be my best way forward here.