Best Practices for Large-Scale Regression

Hi all,

I’m trying to fit a 15 million large dataset with 11,000 variables in a logistic regression model. I have been playing around on an interactive notebook with toy datasets of tens of thousands of data points with 10-100 variables, and these are slow enough that my goal regression seems rather infeasible, even when it’ll be run on a cluster.

Are there any best practices to use when fitting models of this scale?

Thank you!

In my experience (though I certainly hope that someone does have differing experience from myself, and can tell you some effective best practices), anything larger than ~100 var.s, and ~4000 datapoints is going to kill any hope of even a basic regression in Pymc3.

I attempted once to get a regression such as the above going, and spent a couple of months on it, making sure to A) parametrize all my variables correctly, B) use slightly more informative priors when possible to just make the model run, C) limit my variables from the original 400+ to just the 100 most informative ones in the end, and D) ran everything on a Google Compute cluster, and the regression would barely finish after 8+ hr.s each time, and that was the absolute limit.

The numbers you are referencing unfortunately are pure-ML territory, and Bayesian statistics afaik can’t deal with them yet.