Real-life example on Housing price regression: advice requested

So, having trawled through these forums for the last few months, and having asked a few questions and received good ansers, I decided to try to utilize all I had learned and fully tackle Kaggle’s Housing price challenge - because isn’t a Kaggle competition the prime place to test one’s skills? Below, I will list all I have done so far, list the computational problem I’m dealing with, and will appreciate any comments on whether my difficulty is standard fair for Bayesian methods, or if I making mistakes somewhere along the way.

  1. Introduction
    This challenge is to predict Housing prices based on a dataset of 81 variables and sample size of 1460 points. Loading in the training data, nothing is out of the ordinary for information describing every possible aspect of some real estate.

  2. Data Preprocessing
    I next did the following:

    a.). remove all columns with any missing values (19 removed)
    b.) create dummy variables out of all the columns that I can (31 converted)
    c.) label, with integer ranges, the columns that included multiple categories (12 converted)
    d.) the dummy+labelled variable creations also removed all of the remaining NaN’s, since they represented 0’s in the vast majority of cases
    e.) normalize with sklearn.preprocessing.StandardScaler() all of the non-dummy variables (so even the labelled variables are now centered and scaled).

    Now I have 250+ variables to work with :sunglasses:.

  3. Getting Pymc3’s Model to Run:
    First few runs: 1st run - with uniform priors on all 270 variables, Pymc3 would inevitably crash at some point, no matter how much I tinkered with the sampler settings (tune, draw, different init's). Consequently, for the next runs, I focused on specifying the most minimally-informative priors that would get the model to consistently run. At some point, after a 20 hr.+ model successfully ran using vague HalfNormal() priors, the traceplots showed me that each variable wanted to be centered around 0, and so, in the end, I settled on this model (using lucianopaz’s suggestions here on doing so robustly):

     def model_factory(X, Y):
         with pm.Model() as less_vars_pymc3_model:
    
             # Priors 
            beta = pm.Normal('beta', mu=0, sd=3, shape=(270, 1))
            intercept = pm.Normal('intercept', mu=0, sd=3)
    
            std = pm.HalfNormal('std', sd=5)
    
            # Likelihood
            price = intercept + pm.math.dot(X, beta)
    
            y_lik = pm.Normal('y_lik', mu=price, sd=std, observed = Y)
     
        return model
     
     ## Running model now
     x_shared = theano.shared(train_norm_retained.values)
     y_shared = theano.shared(np.log(SalePrice.values))
    
     with model_factory2(x_shared, y_shared) as model:
         trace = pm.sample(cores=4)
    
  4. Getting Pymc3’s model to run in humane time

    Now I have a model that can run and produce good predictions (versus what Kaggle reports back to me), but the problem is, it still takes 20+ hr.s for each run, which eliminates any possible iterative, creative model tinkering and development cycle.

    To speed Pymc3 up, so far I have done the following:

    a.) run all this on Google’s Compute Engine, using their Xeon Skylake cpu’s (that is all the information Google gives, but I imagine these cpu’s are relatively fast)
    b.) scaled all of my variables using scipy as described above
    c.) used many different init= configurations in pm.sample(), including init='advi+adapt_diag', n_init=20000, which manages to speed the model up to running in ~4 hr.s
    d.) tried using exoplanet's sampler in 2.2.1 (with all the above settings), with no discernible speedups

    Without the model running in, let’s say, 20 minutes or less, I cannot imagine doing careful diagnostic checks, residual analysis, and the things that Data Science recommends to craft good models.

    Deciding to cut my losses to improve run-time, I decided to select subset of my most important variables, and only use them (eventually settled on just 10), using

    skl.feature_selection.GenericUnivariateSelect(score_func=skl.feature_selection.mutual_info_regression, mode='k_best', param=10)

    After doing this, my model still runs in the 14+ hr. range.

Conclusion:

Am I missing some crucial optimizations that need to be done in order to run a model such as mine? I would extremely, extremely, appreciate any insights from the community as how best to tackle such a real-life application of Pymc3.

Thank you all for your time.

My specs:

(Python installed from here)

Python 3.6.8 |Intel Corporation| (default, Mar  1 2019, 00:10:45)`
[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux

numpy==1.16.1
Theano==1.0.4
pymc3==3.6
exoplanet==0.1.5 
scikit-learn==0.20.3
scipy==1.2.1

I have attached a csv of the final prepared data, one with all the variables, and one with just 10 variables; test data receives same treatment.
train_norm.csv (1.9 MB)
train_norm_retained.csv (279.3 KB)

2 Likes

That was going to be my suggestion - I would try to compare the samples result with the (FullRank)ADVI approximation, if it is close enough I would just do VI instead of sampling.

Looks great! I’m often troubled by the disparity between point-estimate approaches (optimization) and uncertainty-propagation approaches (sampling). As great as NUTS is, the apparent standard for large-scale problems appears to still be some kind of lovingly hand-crafted MCMC.

With that said, one of the things one needs to worry about for linear regression is multicollinearity. The statistical effect is variance inflation – but this corresponds to

  1. More iterations needed for maximum-likelihood approaches
  2. Slower sampling for Bayesian approaches

A quick view of the training data suggests that there’s about a ~40-dimensional subspace which is not constrained by the observed data:

dat = pd.read_csv('~/Downloads/train_norm.csv')
dat_cov = np.cov(dat.values.T)
eig = np.linalg.eigh(dat_cov)
np.sum(eig[0] < 1e-15)

As such, a more informative prior (or eliminating this source of multicollinearity) could help.

In addition, the posterior distribution over beta is unimodal; so ADVI (or a low-rank Householder normalizing flow) should suffice in lieu of nuts. These are both amenable to minibatch which will speed up convergence.

If you want to go crazy, there is work on “minibatch sampling” (i.e. sampling on data subsets and combining posteriors) such as WASP and PIE. I can’t find mention of these in pymc3, however. Maybe they’re in the works?

Lastly, you might take a hybrid approach and treat the low-ranking features (that you discarded) with a Gaussian Process kernel. (I would pre-compute this and add it as a random effect directly):

K = skl.gaussian_process.kernels.RBF(length_scale=0.5).fit_transform(X_dropped)
L_K = np.linalg.chol(K)
z = pm.Normal('random_eff', mu=0, sd=3)
price = intercept + pm.math.dot(X, beta) + pm.math.dot(L_K, z)
7 Likes

Thank you so much for the informative reply, chartl! The 40 dimensional subspace reading does seem discouraging, so would you give me the quickest primer - just a sentence or two - on what this type of eigenvector calculation reveals?

I used the eigenvector calculation on the subsets of my variables as well, and starting from 50 variables, it only reports 1 unbounded dimension, and at 30 variables and below, it reports 0 unbounded dimensions.

Would this mean there is some extra hidden complexity within my variables that prevents my model from running faster?

Multicollinearity occurs when two or more independent variables (i.e. columns of X) are linearly dependent – that is, you can predict (almost exactly) one of the variables from all of the others. The most typical case of this is combining an intercept with a dummy-encoded categorical variable, so you have an X matrix that looks like

\left(\begin{array}{cccc} 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \\ \end{array}\right)

Clearly x_1 = x_2 + x_3 + x_4 is an invariant. This is a problem because if \beta_1, \beta_2, \beta_3, \beta_4 is a solution to the regression

y = \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4

then so is

y = 0x_1 + (\beta_2 + \beta_1)x_2 + (\beta_3 + \beta_1)x_3 + (\beta_4 + \beta_1)x_4

Basically you can replace \beta_1 with any value, and adjust \beta_2, \beta_3, \beta_4 so that the solution is the same. This means that there’s an entire free dimension of solutions; and that causes problem for the sampler. Empirically, fixing invariances not only fixes problems with posterior convergence, but also helps the sampler run much faster.

The more of such invariants there are, the worse the multicollinearity is. In particular, the nullspace of X tells you exactly what these invariants are; and the nullspace of X are exactly the zero-eigenvalues (and corresponding eigenvectors) of X^TX. If you’re familiar with the concept of matrix rank, the rank of X is equal to the number of columns of X, minus the dimension of its nullspace (or, the number of nonzero eigenvalues of X^TX). If the rank of X is anything less than the number of columns, X tends to be called “ill-conditioned”; and ill-conditioned matrices are a pestilence on all forms of numerical analysis.

So what the eigenvalue calculation above does is tell you that there are about 40 linear invariants within X. The most common cause of this are dummifying multiple categorical variables, and forgetting to drop one of the categories; otherwise you get a 1 vector as a common sum:

C_{11} + \dots + C_{1p} = \vec{\mathbf{1}} = C_{21} + \dots + C_{2q}

5 Likes

Right, I should probably mention a couple of methods of fixing the issue.

  1. Drop variables. If you’ve got a case where you can predict one variable from the rest, you don’t actually need it for the regression; so drop it. The interpretation will be “fuzzy” either way.

  2. Feature extraction (typically PCA). If X is not full rank, compute the SVD and only use features corresponding to the nonzero eigenvalues.

  3. Residualization. Many times multicollinearity isn’t exact; but the invariants hold approximately (say with r^2 > 0.8)). Residualization is basically: if X_j \approx X_{\mathcal{I}}\beta for some index set \mathcal{I}, then replace X_j with a residual version \tilde X_j = X_j - X_\mathcal{I}\beta

  4. Regularization. By setting the prior on \beta to be tight, then the likelihood of “extreme” values for any \beta_i become small, thus constraining the volume of the invariant space. In my opinion this is a far inferior solution to (1-3); but sometimes for interpretability reasons you need the model to “decide on a tradeoff” between two highly-related variables, and this is one way to force it to do so.

4 Likes