So, having trawled through these forums for the last few months, and having asked a few questions and received good ansers, I decided to try to utilize all I had learned and fully tackle Kaggle’s Housing price challenge - because isn’t a Kaggle competition the prime place to test one’s skills? Below, I will list all I have done so far, list the computational problem I’m dealing with, and will appreciate any comments on whether my difficulty is standard fair for Bayesian methods, or if I making mistakes somewhere along the way.
-
Introduction
This challenge is to predict Housing prices based on a dataset of 81 variables and sample size of 1460 points. Loading in the training data, nothing is out of the ordinary for information describing every possible aspect of some real estate. -
Data Preprocessing
I next did the following:a.). remove all columns with any missing values (19 removed)
b.) create dummy variables out of all the columns that I can (31 converted)
c.) label, with integer ranges, the columns that included multiple categories (12 converted)
d.) the dummy+labelled variable creations also removed all of the remaining NaN’s, since they represented 0’s in the vast majority of cases
e.) normalize withsklearn.preprocessing.StandardScaler()
all of the non-dummy variables (so even the labelled variables are now centered and scaled).Now I have 250+ variables to work with .
-
Getting Pymc3’s Model to Run:
First few runs: 1st run - with uniform priors on all 270 variables, Pymc3 would inevitably crash at some point, no matter how much I tinkered with the sampler settings (tune
,draw
, differentinit
's). Consequently, for the next runs, I focused on specifying the most minimally-informative priors that would get the model to consistently run. At some point, after a 20 hr.+ model successfully ran using vague HalfNormal() priors, the traceplots showed me that each variable wanted to be centered around 0, and so, in the end, I settled on this model (using lucianopaz’s suggestions here on doing so robustly):def model_factory(X, Y): with pm.Model() as less_vars_pymc3_model: # Priors beta = pm.Normal('beta', mu=0, sd=3, shape=(270, 1)) intercept = pm.Normal('intercept', mu=0, sd=3) std = pm.HalfNormal('std', sd=5) # Likelihood price = intercept + pm.math.dot(X, beta) y_lik = pm.Normal('y_lik', mu=price, sd=std, observed = Y) return model ## Running model now x_shared = theano.shared(train_norm_retained.values) y_shared = theano.shared(np.log(SalePrice.values)) with model_factory2(x_shared, y_shared) as model: trace = pm.sample(cores=4)
-
Getting Pymc3’s model to run in humane time
Now I have a model that can run and produce good predictions (versus what Kaggle reports back to me), but the problem is, it still takes 20+ hr.s for each run, which eliminates any possible iterative, creative model tinkering and development cycle.
To speed Pymc3 up, so far I have done the following:
a.) run all this on Google’s Compute Engine, using their Xeon Skylake cpu’s (that is all the information Google gives, but I imagine these cpu’s are relatively fast)
b.) scaled all of my variables using scipy as described above
c.) used many differentinit=
configurations inpm.sample()
, includinginit='advi+adapt_diag', n_init=20000
, which manages to speed the model up to running in ~4 hr.s
d.) tried usingexoplanet
's sampler in 2.2.1 (with all the above settings), with no discernible speedupsWithout the model running in, let’s say, 20 minutes or less, I cannot imagine doing careful diagnostic checks, residual analysis, and the things that Data Science recommends to craft good models.
Deciding to cut my losses to improve run-time, I decided to select subset of my most important variables, and only use them (eventually settled on just 10), using
skl.feature_selection.GenericUnivariateSelect(score_func=skl.feature_selection.mutual_info_regression, mode='k_best', param=10)
After doing this, my model still runs in the 14+ hr. range.
Conclusion:
Am I missing some crucial optimizations that need to be done in order to run a model such as mine? I would extremely, extremely, appreciate any insights from the community as how best to tackle such a real-life application of Pymc3.
Thank you all for your time.
My specs:
(Python installed from here)
Python 3.6.8 |Intel Corporation| (default, Mar 1 2019, 00:10:45)`
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux
numpy==1.16.1
Theano==1.0.4
pymc3==3.6
exoplanet==0.1.5
scikit-learn==0.20.3
scipy==1.2.1
I have attached a csv of the final prepared data, one with all the variables, and one with just 10 variables; test data receives same treatment.
train_norm.csv (1.9 MB)
train_norm_retained.csv (279.3 KB)