Lesson 2.1: From Simple to Multiple Linear Regression

We now upgrade our engine from a single cause to multiple causes. This lesson introduces Multiple Linear Regression (MLR), allowing us to model an outcome using several predictors at once. We'll discover the core power of MLR—the ability to 'control for' other variables—and master the essential matrix algebra that makes it possible.

Part 1: The Power of 'Controlling For' Variables

In the real world, outcomes rarely have a single cause. A stock's return is driven by the market, interest rates, and sector performance. An exam score is driven by study hours, attendance, and prior GPA.

Multiple Linear Regression (MLR) allows us to isolate the effect of one variable while holding others constant. This is its superpower.

The Core Analogy: The Ice Cream & Shark Attack Problem

Imagine you collect data and find a strong positive correlation between monthly ice cream sales (X1X_1) and shark attacks (YY). A simple regression would show β^1>0\hat{\beta}_1 > 0 and be statistically significant.

  • Naive Conclusion: "Ice cream causes shark attacks!"
  • The Problem: There is a lurking, unobserved variable: **average monthly temperature** (X2X_2). When it's hot, more people buy ice cream, AND more people go swimming.
  • The MLR Solution: By including both variables in the model, Y^=β^0+β^1(Ice Cream)+β^2(Temp)\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1(\text{Ice Cream}) + \hat{\beta}_2(\text{Temp}), the algorithm can see that once temperature is accounted for, the effect of ice cream sales on shark attacks becomes zero (β^10\hat{\beta}_1 \approx 0).

MLR allows us to untangle these complex relationships by estimating the effect of each variable *ceteris paribus*—all other things being equal.

Part 2: The Model in Matrix Form

As we add more variables, the simple slope-intercept formula becomes unwieldy. We must use the compact and powerful language of Linear Algebra. The MLR model is written as:

y=Xβ+ϵ\mathbf{y} = \mathbf{X} \bm{\beta} + \bm{\epsilon}

Let's visually deconstruct this for a model with nn data points and kk predictors.

Deconstructing the Matrices
[y1y2yn]y (n×1)=[1x11x1k1x21x2k1xn1xnk]X (n×(k+1))[β0β1βk]β ((k+1)×1)+[ϵ1ϵ2ϵn]ϵ (n×1)\underbrace{\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix}}_{\mathbf{y}~(n \times 1)} = \underbrace{\begin{bmatrix} 1 & x_{11} & \dots & x_{1k} \\ 1 & x_{21} & \dots & x_{2k} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & \dots & x_{nk} \end{bmatrix}}_{\mathbf{X}~(n \times (k+1))} \underbrace{\begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_k \end{bmatrix}}_{\bm{\beta}~((k+1) \times 1)} + \underbrace{\begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix}}_{\bm{\epsilon}~(n \times 1)}
  • y\mathbf{y}: The vector of your outcome variable.
  • X\mathbf{X}: The **Design Matrix**. Each row is an observation (a student, a day). Each column is a variable (a "feature"). The first column is always 1s to represent the intercept.
  • β\bm{\beta}: The vector of parameters we want to estimate.

Part 3: The 'Master Formula' and Its Interpretation

The beauty of the matrix form is that the OLS solution we derived in the Linear Algebra path works perfectly, unchanged. The formula doesn't care if X\mathbf{X} has 2 columns or 2000 columns.

The OLS Estimator for MLR

β^=(XTX)1XTy\hat{\bm{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y}
The Crucial Interpretation of a Coefficient, β^j\hat{\beta}_j

In MLR, the interpretation of a coefficient gains a critical new phrase:

"β^j\hat{\beta}_j is the estimated change in YY for a one-unit increase in XjX_j, **holding all other variables in the model constant**."

This is the mathematical equivalent of "controlling for" the other factors.

Part 4: The Problem with R² in MLR

In MLR, the standard R2=1SSR/TSSR^2 = 1 - \text{SSR}/\text{TSS} is a flawed metric. Why? Adding *any* variable to the model, even a column of random garbage, will **never cause the R² to decrease**. At worst, the SSR stays the same; usually, it goes down a tiny bit just by chance, pushing R² up.

This encourages overfitting. We need a metric that penalizes us for adding useless variables. That metric is **Adjusted R²**.

Definition: Adjusted R-Squared

Adjusted R² modifies the regular R² by adjusting the SSR and TSS by their respective degrees of freedom.

Adjusted R2=1SSR/(nk1)TSS/(n1)\text{Adjusted } R^2 = 1 - \frac{\text{SSR}/(n-k-1)}{\text{TSS}/(n-1)}
  • nn is the number of observations.
  • kk is the number of predictors.

Adding a new variable increases kk, which increases the penalty term. If the new variable doesn't reduce SSR by a large enough amount to offset the penalty, the Adjusted R² will actually go **down**. This makes it a much more honest measure of a model's explanatory power.

What's Next? The Engine of Learning

We've upgraded our model to handle multiple inputs. The matrix formula β^=(XTX)1XTy\hat{\bm{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} gives us the answer in one clean shot. This is called an **analytical solution**.

But for most advanced machine learning models (like neural networks), a clean analytical solution doesn't exist. We can't just solve a formula. Instead, we have to find the best parameters through an iterative search process.

In the next lesson, we will open the black box of "training" and learn the single most important algorithm in all of machine learning: **Gradient Descent**.