Lesson 4.3: Upgrading to a 3D World: Multiple Linear Regression (MLR)

We now move beyond a single cause and effect. This lesson upgrades our engine from Simple to Multiple Linear Regression, allowing us to model an outcome using several predictors at once. We'll discover the core power of MLR—the ability to 'control for' other variables—and master the essential matrix algebra that makes it possible.

Part 1: The Power of "Controlling For" Variables

In the real world, outcomes rarely have a single cause. A stock's return is driven by the market, interest rates, and sector performance. An exam score is driven by study hours, attendance, and prior GPA.

Multiple Linear Regression (MLR) allows us to isolate the effect of one variable while holding others constant. This is its superpower.

The Core Intuition: The Ice Cream & Shark Attack Problem

Imagine you collect data and find a strong positive correlation between monthly ice cream sales (X1X_1) and shark attacks (YY). A simple regression would show β^1>0\hat{\beta}_1 > 0 and be statistically significant.

  • Naive Conclusion: "Ice cream causes shark attacks!"
  • The Problem: There is a lurking, unobserved variable: **average monthly temperature** (X2X_2). When it's hot, more people buy ice cream, AND more people go swimming.
  • The MLR Solution: By including both variables in the model, Y^=β^0+β^1(Ice Cream)+β^2(Temp)\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1(\text{Ice Cream}) + \hat{\beta}_2(\text{Temp}), the algorithm can see that once temperature is accounted for, the effect of ice cream sales on shark attacks becomes zero (β^10\hat{\beta}_1 \approx 0).

MLR allows us to untangle these complex relationships by estimating the effect of each variable *ceteris paribus*—all other things being equal.

Part 2: The Model in Matrix Form

As we add more variables, the calculus becomes impossible. We must use the language of Linear Algebra. The MLR model is written as:

y=Xβ+ϵ\mathbf{y} = \mathbf{X} \bm{\beta} + \bm{\epsilon}

Let's visually deconstruct this for a model with nn data points and kk predictors.

Deconstructing the Matrices
[y1y2yn]y (n×1)=[1x11x1k1x21x2k1xn1xnk]X (n×(k+1))[β0β1βk]β ((k+1)×1)+[ϵ1ϵ2ϵn]ϵ (n×1)\underbrace{\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}}_{\mathbf{y}~(n \times 1)} = \underbrace{\begin{bmatrix} 1 & x_{11} & \dots & x_{1k} \\ 1 & x_{21} & \dots & x_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \dots & x_{nk} \end{bmatrix}}_{\mathbf{X}~(n \times (k+1))} \underbrace{\begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{bmatrix}}_{\bm{\beta}~((k+1) \times 1)} + \underbrace{\begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}}_{\bm{\epsilon}~(n \times 1)}
  • y\mathbf{y}: The vector of your outcome variable.
  • X\mathbf{X}: The **Design Matrix**. Each row is an observation (a student, a day). Each column is a variable (a "feature"). The first column is always 1s to represent the intercept.
  • β\bm{\beta}: The vector of parameters we want to estimate.

Part 3: The Master Formula and Its Interpretation

The beauty of the matrix form is that the OLS solution we derived in the last lesson works perfectly, unchanged. The formula doesn't care if X\mathbf{X} has 2 columns or 2000 columns.

The OLS Estimator for MLR

β^=(XTX)1XTy\hat{\bm{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y}
The Crucial Interpretation of a Coefficient, β^j\hat{\beta}_j

In MLR, the interpretation of a coefficient gains a critical new phrase:

"β^j\hat{\beta}_j is the estimated change in YY for a one-unit increase in XjX_j, **holding all other variables in the model constant**."

This is the mathematical equivalent of "controlling for" the other factors.

Part 4: The Problem with R² in MLR

In MLR, the standard R2=1SSR/TSSR^2 = 1 - \text{SSR}/\text{TSS} is a flawed metric. Why? Adding *any* variable to the model, even a column of random garbage, will **never cause the R² to decrease**. At worst, the SSR stays the same; usually, it goes down a tiny bit just by chance, pushing R² up.

This encourages overfitting. We need a metric that penalizes us for adding useless variables. That metric is **Adjusted R²**.

Definition: Adjusted R-Squared

Adjusted R² modifies the regular R² by adjusting the SSR and TSS by their respective degrees of freedom.

Adjusted R2=1SSR/(nk1)TSS/(n1)\text{Adjusted } R^2 = 1 - \frac{\text{SSR}/(n-k-1)}{\text{TSS}/(n-1)}
  • nn is the number of observations.
  • kk is the number of predictors.

Adding a new variable increases kk, which increases the penalty term. If the new variable doesn't reduce SSR by a large enough amount to offset the penalty, the Adjusted R² will actually go **down**. This makes it a much more honest measure of a model's explanatory power.

What's Next? The Derivation

We've defined the MLR model and seen the "master formula" that solves it. The structure is identical to the one we proved in the last lesson.

In the next lesson, we will briefly confirm that the matrix calculus derivation for MLR is indeed the same as for SLR, solidifying our understanding of the Normal Equations and the power of the matrix approach.