Best Practices for Large-Scale Regression

timshell · November 25, 2019, 5:27pm

Hi all,

I’m trying to fit a 15 million large dataset with 11,000 variables in a logistic regression model. I have been playing around on an interactive notebook with toy datasets of tens of thousands of data points with 10-100 variables, and these are slow enough that my goal regression seems rather infeasible, even when it’ll be run on a cluster.

Are there any best practices to use when fitting models of this scale?

Thank you!

Gon_F · November 25, 2019, 10:30pm

In my experience (though I certainly hope that someone does have differing experience from myself, and can tell you some effective best practices), anything larger than ~100 var.s, and ~4000 datapoints is going to kill any hope of even a basic regression in Pymc3.

I attempted once to get a regression such as the above going, and spent a couple of months on it, making sure to A) parametrize all my variables correctly, B) use slightly more informative priors when possible to just make the model run, C) limit my variables from the original 400+ to just the 100 most informative ones in the end, and D) ran everything on a Google Compute cluster, and the regression would barely finish after 8+ hr.s each time, and that was the absolute limit.

The numbers you are referencing unfortunately are pure-ML territory, and Bayesian statistics afaik can’t deal with them yet.

Topic		Replies	Views
Bayesian regression: crashes and performance issues with large datasets and many parameters with different priors? Questions	7	1538	January 27, 2021
Trouble putting together a regression model in pymc3 Questions	4	683	June 10, 2019
Real-life example on Housing price regression: advice requested Questions	5	2658	April 3, 2019
Is it possible to add a breakpoint while running a logistic regression model with large sample size?	0	250	June 23, 2023
New To PyMC3 \| Logistic Regression - Bug Questions bug	4	783	November 4, 2020

Best Practices for Large-Scale Regression

Related topics