Lesson 4.9: Multicollinearity and the VIF

We now begin 'Act III' of our module: real-world diagnostics. This lesson tackles Multicollinearity, a common disease of regression models where predictor variables are highly correlated. We'll explore why this wrecks our inference and learn to diagnose its severity with the Variance Inflation Factor (VIF).

Part 1: The Problem of Redundant Information

Multicollinearity occurs when two or more of your predictor variables (XjX_j) are highly correlated with each other. They are essentially telling the same story.

The 'Two Tour Guides' Analogy

Imagine you're on a tour trying to understand a city's history (YY). You have two tour guides (X1X_1 and X2X_2). But these guides are standing right next to each other and whispering to each other before they speak. They say almost the exact same thing.

When they're done, you are asked: "Which guide was more helpful?" It's an impossible question to answer. Because their information was so redundant, you can't disentangle their individual contributions.

This is the core problem of multicollinearity. The OLS algorithm can't confidently attribute the effect on YY to any single predictor because they all move together.

Perfect Multicollinearity

One predictor is a perfect linear function of another (e.g., X2=2X1X_2 = 2X_1). This violates a core Gauss-Markov assumption. The matrix (XTX)(\mathbf{X}^T\mathbf{X}) cannot be inverted, and the OLS formula β^=(XTX)1XTy\bm{\hat{\beta}}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} fails completely.

High (Imperfect) Multicollinearity

Predictors are highly, but not perfectly, correlated. This is far more common and insidious. The model runs, but the results are unreliable. Our focus is on diagnosing and fixing this problem.

Part 2: The Mathematical Consequences

The Consequence of Multicollinearity

Under multicollinearity, the OLS estimator β^\bm{\hat{\beta}} remains **unbiased** and **consistent**. However, it is no longer **efficient** (not BLUE).

The variance of the OLS estimates becomes hugely inflated.

This inflated variance has disastrous effects on our inference:

  • Large Standard Errors: Since se(β^j)=Var(β^j)\text{se}(\hat{\beta}_j) = \sqrt{\text{Var}(\hat{\beta}_j)}, our standard errors become enormous.
  • Insignificant t-statistics: The t-statistic, t=β^j/se(β^j)t = \hat{\beta}_j / \text{se}(\hat{\beta}_j), collapses towards zero because the denominator is so large. This leads us to incorrectly conclude that important variables are statistically insignificant (Type II Error).
  • Unstable Coefficients: The estimates become extremely sensitive to small changes in the data. Adding or removing a few data points can cause the coefficients to swing wildly in magnitude or even flip signs, making the model untrustworthy.

Part 3: Diagnosis with the Variance Inflation Factor (VIF)

How do we measure this variance inflation? The Variance Inflation Factor (VIF) gives us the exact answer.

Derivation of the VIF Formula

For a model with multiple predictors, the true variance of a single coefficient β^j\hat{\beta}_j can be proven to be:

Var(β^j)=σ2TSSj(1Rj2)\text{Var}(\hat{\beta}_j) = \frac{\sigma^2}{\text{TSS}_j \cdot (1 - R_j^2)}
  • TSSj=(XijXˉj)2\text{TSS}_j = \sum(X_{ij} - \bar{X}_j)^2 is the total sum of squares for predictor XjX_j.
  • Rj2R_j^2 is the R-squared from an **auxiliary regression** where we regress XjX_j on all *other* predictors in the model.

The variance of that same coefficient in a *simple* regression (just YY on XjX_j) would have been σ2TSSj\frac{\sigma^2}{\text{TSS}_j}. The VIF is the ratio of these two variances:

VIFj=Var(β^j)MLRVar(β^j)SLR=11Rj2\text{VIF}_j = \frac{\text{Var}(\hat{\beta}_j)_{\text{MLR}}}{\text{Var}(\hat{\beta}_j)_{\text{SLR}}} = \frac{1}{1 - R_j^2}

The Variance Inflation Factor (VIF)

VIFj=11Rj2\text{VIF}_j = \frac{1}{1 - R_j^2}

The VIF for predictor XjX_j tells you how many times larger its variance is than it would be if it were uncorrelated with the other predictors.

Interpreting the VIF

The logic is simple. The Rj2R_j^2 from the auxiliary regression measures how much of XjX_j's variance is explained by the *other predictors*.

  • If Rj2=0R_j^2 = 0 (no collinearity), VIFj=1/(10)=1\text{VIF}_j = 1/(1-0) = 1. There is no variance inflation.
  • If Rj2=0.9R_j^2 = 0.9 (severe collinearity), VIFj=1/(10.9)=10\text{VIF}_j = 1/(1-0.9) = 10. The variance of β^j\hat{\beta}_j is **10 times larger** than it should be. Its standard error is 103.16\sqrt{10} \approx 3.16 times larger.

Common Rule of Thumb:

A VIF greater than 5 or 10 is a sign of problematic multicollinearity.

Part 4: Solutions and Remedies

How to Fix Multicollinearity

If you detect a high VIF for a variable, what can you do?

  1. Drop one of the variables. If two variables are highly correlated (e.g., `height` and `weight`), they are redundant. Pick the one that is more theoretically relevant and remove the other.
  2. Combine the variables. Create a new, composite variable. For example, instead of using `household_income` and `number_of_earners` separately, create a single feature like `income_per_earner`.
  3. Use a different modeling technique. This is the ML approach. Techniques like **Ridge Regression (L2 Regularization)** are specifically designed to perform well even in the presence of high multicollinearity by adding a penalty term that shrinks correlated coefficients.
  4. Do nothing. If your goal is purely **prediction** and not **inference** (interpreting coefficients), and your model is performing well on a test set, you might choose to live with the multicollinearity. The inflated variance of individual coefficients doesn't necessarily harm the model's overall predictive power.

What's Next? When the Noise isn't Constant

We've now diagnosed and learned how to fix a critical problem with our predictors, X\mathbf{X}. We've ensured that the matrix (XTX)(\mathbf{X}^T\mathbf{X}) is well-behaved.

But what if the problem isn't with X\mathbf{X}, but with our error term, ϵ\bm{\epsilon}? The Gauss-Markov theorem required that the variance of the errors be constant for all observations.

In the next lesson, we will tackle the extremely common and important violation of this assumption: **Heteroskedasticity**.