Real-life example on Housing price regression: advice requested

chartl · April 3, 2019, 9:19pm

Multicollinearity occurs when two or more independent variables (i.e. columns of X) are linearly dependent – that is, you can predict (almost exactly) one of the variables from all of the others. The most typical case of this is combining an intercept with a dummy-encoded categorical variable, so you have an X matrix that looks like

\left(\begin{array}{cccc} 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \\ \end{array}\right)

Clearly x_1 = x_2 + x_3 + x_4 is an invariant. This is a problem because if \beta_1, \beta_2, \beta_3, \beta_4 is a solution to the regression

y = \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4

then so is

y = 0x_1 + (\beta_2 + \beta_1)x_2 + (\beta_3 + \beta_1)x_3 + (\beta_4 + \beta_1)x_4

Basically you can replace \beta_1 with any value, and adjust \beta_2, \beta_3, \beta_4 so that the solution is the same. This means that there’s an entire free dimension of solutions; and that causes problem for the sampler. Empirically, fixing invariances not only fixes problems with posterior convergence, but also helps the sampler run much faster.

The more of such invariants there are, the worse the multicollinearity is. In particular, the nullspace of X tells you exactly what these invariants are; and the nullspace of X are exactly the zero-eigenvalues (and corresponding eigenvectors) of X^TX. If you’re familiar with the concept of matrix rank, the rank of X is equal to the number of columns of X, minus the dimension of its nullspace (or, the number of nonzero eigenvalues of X^TX). If the rank of X is anything less than the number of columns, X tends to be called “ill-conditioned”; and ill-conditioned matrices are a pestilence on all forms of numerical analysis.

So what the eigenvalue calculation above does is tell you that there are about 40 linear invariants within X. The most common cause of this are dummifying multiple categorical variables, and forgetting to drop one of the categories; otherwise you get a 1 vector as a common sum:

C_{11} + \dots + C_{1p} = \vec{\mathbf{1}} = C_{21} + \dots + C_{2q}

Topic		Replies	Views
Advice for speeding up large multivariate models Questions	17	2208	April 12, 2018
Excessive memory - Multiple regression Questions	9	2931	October 24, 2017
Bayesian regression: crashes and performance issues with large datasets and many parameters with different priors? Questions	7	1644	January 27, 2021
Advice on dimensionality reduction? Questions	10	954	March 27, 2019
Sampling from multilevel multivariable regression model is very slow Questions	15	1407	May 19, 2020

Real-life example on Housing price regression: advice requested

Related topics