Lesson 4.4: The 'Master Formula' Derivation (MLR)

This lesson confirms and formalizes our 'master formula.' We will perform the 'no-skip' matrix calculus derivation for the Multiple Linear Regression (MLR) estimator, proving that the solution is the general solution for any number of variables.

Part 1: The MLR Objective Function

Our goal remains identical to the simple case: we must find the vector of coefficients, β^\bm{\hat{\beta}}, that minimizes the Sum of Squared Residuals (SSR).

The power of matrix algebra is that the objective function looks the same, whether we have one predictor or a thousand.

MLR Objective Function (SSR)

Minimize the scalar value S(β^)S(\bm{\hat{\beta}}), which is the sum of squared elements of the residual vector e=yXβ^\mathbf{e} = \mathbf{y} - \mathbf{X}\bm{\hat{\beta}}:

minβ^S(β^)=eTe=(yXβ^)T(yXβ^)\min_{\bm{\hat{\beta}}} S(\bm{\hat{\beta}}) = \mathbf{e}^T \mathbf{e} = (\mathbf{y} - \mathbf{X} \bm{\hat{\beta}})^T (\mathbf{y} - \mathbf{X} \bm{\hat{\beta}})

Part 2: The Matrix Calculus Derivation

The Full 'No-Skip' Derivation

Step 1: Expand the Objective Function. Using the transpose rule (AB)T=BTAT(\mathbf{AB})^T = \mathbf{B}^T\mathbf{A}^T:

S(β^)=(yTβ^TXT)(yXβ^)S(\bm{\hat{\beta}}) = (\mathbf{y}^T - \bm{\hat{\beta}}^T\mathbf{X}^T) (\mathbf{y} - \mathbf{X} \bm{\hat{\beta}})

Now, expand the product (like FOIL):

S(β^)=yTyyTXβ^β^TXTy+β^TXTXβ^S(\bm{\hat{\beta}}) = \mathbf{y}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\bm{\hat{\beta}} - \bm{\hat{\beta}}^T\mathbf{X}^T\mathbf{y} + \bm{\hat{\beta}}^T\mathbf{X}^T\mathbf{X}\bm{\hat{\beta}}

The two middle terms are both scalars (1x1 matrices), and the transpose of a scalar is itself. Therefore, yTXβ^=(yTXβ^)T=β^TXTy\mathbf{y}^T\mathbf{X}\bm{\hat{\beta}} = (\mathbf{y}^T\mathbf{X}\bm{\hat{\beta}})^T = \bm{\hat{\beta}}^T\mathbf{X}^T\mathbf{y}. We can combine them:

S(β^)=yTy2β^TXTy+β^TXTXβ^S(\bm{\hat{\beta}}) = \mathbf{y}^T\mathbf{y} - 2\bm{\hat{\beta}}^T\mathbf{X}^T\mathbf{y} + \bm{\hat{\beta}}^T\mathbf{X}^T\mathbf{X}\bm{\hat{\beta}}

Step 2: Differentiate with respect to the vector β^\bm{\hat{\beta}}. To find the minimum, we take the derivative (gradient) and set it to the zero vector.

Using the standard matrix calculus rules (aTx)x=a\frac{\partial(\mathbf{a}^T\mathbf{x})}{\partial\mathbf{x}}=\mathbf{a} and (xTAx)x=2Ax\frac{\partial(\mathbf{x}^T\mathbf{A}\mathbf{x})}{\partial\mathbf{x}}=2\mathbf{A}\mathbf{x}:

Sβ^=02XTy+2XTXβ^\frac{\partial S}{\partial \bm{\hat{\beta}}} = \mathbf{0} - 2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\bm{\hat{\beta}}

Step 3: Set to zero and solve. This gives us the famous **Normal Equations**.

2XTXβ^=2XTy2\mathbf{X}^T\mathbf{X}\bm{\hat{\beta}} = 2\mathbf{X}^T\mathbf{y}
(XTX)β^=XTy(\mathbf{X}^T\mathbf{X})\bm{\hat{\beta}} = \mathbf{X}^T\mathbf{y}

Step 4: Isolate the estimator β^\bm{\hat{\beta}}. Assuming the matrix (XTX)(\mathbf{X}^T\mathbf{X}) is invertible (i.e., no perfect multicollinearity), we pre-multiply by its inverse to get the final solution.

The OLS Estimator (The Master Formula)

The Multiple Linear Regression OLS estimator is:

β^OLS=(XTX)1XTy\bm{\hat{\beta}}_{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y}

Part 3: The Geometric Interpretation

The Geometry of Orthogonality

The Normal Equations, (XTX)β^=XTy(\mathbf{X}^T\mathbf{X})\bm{\hat{\beta}} = \mathbf{X}^T\mathbf{y}, can be rewritten as:

XT(yXβ^)=0\mathbf{X}^T(\mathbf{y} - \mathbf{X}\bm{\hat{\beta}}) = \mathbf{0}

Since the residual vector is defined as e=yXβ^\mathbf{e} = \mathbf{y} - \mathbf{X}\bm{\hat{\beta}}, this means:

XTe=0\mathbf{X}^T\mathbf{e} = \mathbf{0}

This is a profound geometric statement. It says that the residual vector e\mathbf{e} is **orthogonal (perpendicular)** to every single column of the design matrix X\mathbf{X}. In other words, OLS finds the coefficients such that the resulting "noise" (the residuals) is completely uncorrelated with all of the "signals" (the predictors).

What's Next? Proving It's the 'Best'

We have now rigorously derived the 'master formula' for the OLS coefficients for any number of variables. This formula gives us an *estimate*.

But is this estimate any good? Is it unbiased? Is it the most precise estimate we could possibly find? Under what conditions can we trust it?

In the next lesson, we will begin Act II by laying out the **Classical Linear Model Assumptions**—the five sacred rules that, if met, guarantee our OLS estimator has excellent properties.