Real-life example on Housing price regression: advice requested

Multicollinearity occurs when two or more independent variables (i.e. columns of X) are linearly dependent – that is, you can predict (almost exactly) one of the variables from all of the others. The most typical case of this is combining an intercept with a dummy-encoded categorical variable, so you have an X matrix that looks like

\left(\begin{array}{cccc} 1 & 1 & 0 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 \\ \end{array}\right)

Clearly x_1 = x_2 + x_3 + x_4 is an invariant. This is a problem because if \beta_1, \beta_2, \beta_3, \beta_4 is a solution to the regression

y = \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4

then so is

y = 0x_1 + (\beta_2 + \beta_1)x_2 + (\beta_3 + \beta_1)x_3 + (\beta_4 + \beta_1)x_4

Basically you can replace \beta_1 with any value, and adjust \beta_2, \beta_3, \beta_4 so that the solution is the same. This means that there’s an entire free dimension of solutions; and that causes problem for the sampler. Empirically, fixing invariances not only fixes problems with posterior convergence, but also helps the sampler run much faster.

The more of such invariants there are, the worse the multicollinearity is. In particular, the nullspace of X tells you exactly what these invariants are; and the nullspace of X are exactly the zero-eigenvalues (and corresponding eigenvectors) of X^TX. If you’re familiar with the concept of matrix rank, the rank of X is equal to the number of columns of X, minus the dimension of its nullspace (or, the number of nonzero eigenvalues of X^TX). If the rank of X is anything less than the number of columns, X tends to be called “ill-conditioned”; and ill-conditioned matrices are a pestilence on all forms of numerical analysis.

So what the eigenvalue calculation above does is tell you that there are about 40 linear invariants within X. The most common cause of this are dummifying multiple categorical variables, and forgetting to drop one of the categories; otherwise you get a 1 vector as a common sum:

C_{11} + \dots + C_{1p} = \vec{\mathbf{1}} = C_{21} + \dots + C_{2q}

5 Likes