Lesson 4.1: The Full OLS Derivation (SLR)

This is the foundational mathematical workout for the entire module. We start with our objective—to find the 'best' line—and rigorously derive the exact formulas for the OLS estimators using both calculus and matrix algebra, with no steps skipped.

Part 1: The Objective - Minimizing the Error

In the previous lesson, we established our goal. We have a cloud of data (Xi,YiX_i, Y_i) and we want to find the line Y^i=β^0+β^1Xi\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_i that is the "best fit."

We defined "best" in a very specific way: the line that minimizes the sum of the squared differences between our actual data (YiY_i) and our predicted values (Y^i\hat{Y}_i).

The OLS Objective Function

Find the values of β^0\hat{\beta}_0 and β^1\hat{\beta}_1 that minimize the Sum of Squared Residuals (SSR):

S(β^0,β^1)=i=1nei2=i=1n(YiY^i)2=i=1n(Yiβ^0β^1Xi)2S(\hat{\beta}_0, \hat{\beta}_1) = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^n (Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i)^2

This is a classic optimization problem. We will now solve it in two ways: the intuitive calculus way, and the powerful matrix algebra way.

Part 2: The Calculus Derivation ('No-Skip' Version)

The Calculus Toolkit

To find the minimum of a function with two variables (β^0,β^1\hat{\beta}_0, \hat{\beta}_1), we must take the partial derivative with respect to each variable, set both derivatives to zero, and solve the resulting system of two equations.

Step 1: Minimize with respect to β̂₀ (The Intercept)

We take the partial derivative of the SSR and set it to zero. Using the chain rule, the derivative of the inside term w.r.t β^0\hat{\beta}_0 is -1.

Sβ^0=2(Yiβ^0β^1Xi)(1)=0\frac{\partial S}{\partial \hat{\beta}_0} = \sum 2(Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i) \cdot (-1) = 0

Divide by -2 and distribute the sum:

Yiβ^0β^1Xi=0\sum Y_i - \sum \hat{\beta}_0 - \sum \hat{\beta}_1 X_i = 0

Applying summation rules (C=nC\sum C = nC and CXi=CXi\sum CX_i = C \sum X_i):

Yinβ^0β^1Xi=0\sum Y_i - n\hat{\beta}_0 - \hat{\beta}_1 \sum X_i = 0

Now, we solve for β^0\hat{\beta}_0:

nβ^0=Yiβ^1Xin\hat{\beta}_0 = \sum Y_i - \hat{\beta}_1 \sum X_i

Dividing by nn and recognizing the definitions of sample means, Yˉ\bar{Y} and Xˉ\bar{X}, gives our first key result:

Result 1: The OLS Intercept

β^0=Yˉβ^1Xˉ\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}

Step 2: Minimize with respect to β̂₁ (The Slope)

We take the partial derivative w.r.t β^1\hat{\beta}_1. The chain rule derivative of the inside term is now Xi-X_i.

Sβ^1=2(Yiβ^0β^1Xi)(Xi)=0\frac{\partial S}{\partial \hat{\beta}_1} = \sum 2(Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i) \cdot (-X_i) = 0

Divide by -2 and distribute the XiX_i term:

(XiYiβ^0Xiβ^1Xi2)=0\sum (X_i Y_i - \hat{\beta}_0 X_i - \hat{\beta}_1 X_i^2) = 0

This is our second "First-Order Condition."

Step 3: Solve the System of Equations

We now substitute our result for β^0\hat{\beta}_0 from Step 1 into our second equation from Step 2:

(XiYi(Yˉβ^1Xˉ)Xiβ^1Xi2)=0\sum (X_i Y_i - (\bar{Y} - \hat{\beta}_1 \bar{X}) X_i - \hat{\beta}_1 X_i^2) = 0

Distribute and then group the terms that contain β^1\hat{\beta}_1:

(XiYiYˉXi)+(β^1XˉXiβ^1Xi2)=0\sum (X_i Y_i - \bar{Y}X_i) + \sum (\hat{\beta}_1 \bar{X} X_i - \hat{\beta}_1 X_i^2) = 0
(XiYiYˉXi)+β^1(XˉXiXi2)=0\sum (X_i Y_i - \bar{Y}X_i) + \hat{\beta}_1 \sum (\bar{X} X_i - X_i^2) = 0

Now, we can solve for β^1\hat{\beta}_1:

β^1=(XiYiYˉXi)(Xi2XˉXi)\hat{\beta}_1 = \frac{\sum (X_i Y_i - \bar{Y}X_i)}{\sum (X_i^2 - \bar{X}X_i)}

This is correct, but ugly. Using standard algebraic identities, the numerator is the numerator of the sample covariance, and the denominator is the numerator of the sample variance. This gives the final, beautiful result:

Result 2: The OLS Slope Estimator

β^1=(XiXˉ)(YiYˉ)(XiXˉ)2=Cov(X,Y)Var(X)\hat{\beta}_1 = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sum (X_i - \bar{X})^2} = \frac{\text{Cov}(X,Y)}{\text{Var}(X)}

Part 3: The Matrix Derivation (The General 'Master' Formula)

That calculus was intense. The matrix algebra approach is more abstract but far more powerful, giving a single formula that works for one predictor or one thousand.

Deriving the Matrix ('Normal') Equations

Step 1: Write the SSR in matrix form.

SSR=eTe=(yXβ^)T(yXβ^)\text{SSR} = \mathbf{e}^T\mathbf{e} = (\mathbf{y} - \mathbf{X}\bm{\hat{\beta}})^T (\mathbf{y} - \mathbf{X}\bm{\hat{\beta}})

Step 2: Expand the expression.

SSR=yTy2β^TXTy+β^TXTXβ^\text{SSR} = \mathbf{y}^T\mathbf{y} - 2\bm{\hat{\beta}}^T\mathbf{X}^T\mathbf{y} + \bm{\hat{\beta}}^T\mathbf{X}^T\mathbf{X}\bm{\hat{\beta}}

Step 3: Differentiate with respect to the vector β^\bm{\hat{\beta}} and set to zero. (This uses matrix calculus rules).

Sβ^=2XTy+2XTXβ^=0\frac{\partial S}{\partial \bm{\hat{\beta}}} = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\bm{\hat{\beta}} = \mathbf{0}

Step 4: Solve for β^\bm{\hat{\beta}}.

XTXβ^=XTy\mathbf{X}^T\mathbf{X}\bm{\hat{\beta}} = \mathbf{X}^T\mathbf{y}

The OLS Estimator in Matrix Form

Solving the Normal Equations by pre-multiplying by the inverse gives the master formula:

β^=(XTX)1XTy\bm{\hat{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y}

Part 4: The Grand Unification

This is the moment of truth. We must prove that the abstract matrix formula gives the exact same answer as our intuitive calculus formula for β^1\hat{\beta}_1 in the simple, one-variable case. This is a tough but essential 'no-skip' proof.

Proof: The Matrix Formula Simplifies to the Calculus Formula

For SLR, we define our matrices:

X=[1X11Xn],y=[Y1Yn],β^=[β^0β^1]\mathbf{X} = \begin{bmatrix} 1 & X_1 \\ \vdots & \vdots \\ 1 & X_n \end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} Y_1 \\ \vdots \\ Y_n \end{bmatrix}, \quad \bm{\hat{\beta}} = \begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \end{bmatrix}

1. Calculate XTX\mathbf{X}^T\mathbf{X}:

XTX=[nXiXiXi2]\mathbf{X}^T\mathbf{X} = \begin{bmatrix} n & \sum X_i \\ \sum X_i & \sum X_i^2 \end{bmatrix}

2. Calculate XTy\mathbf{X}^T\mathbf{y}:

XTy=[YiXiYi]\mathbf{X}^T\mathbf{y} = \begin{bmatrix} \sum Y_i \\ \sum X_i Y_i \end{bmatrix}

3. Calculate the inverse (XTX)1(\mathbf{X}^T\mathbf{X})^{-1}:

The determinant is D=nXi2(Xi)2=n(XiXˉ)2D = n\sum X_i^2 - (\sum X_i)^2 = n \sum(X_i - \bar{X})^2.

(XTX)1=1n(XiXˉ)2[Xi2XiXin](\mathbf{X}^T\mathbf{X})^{-1} = \frac{1}{n \sum(X_i - \bar{X})^2} \begin{bmatrix} \sum X_i^2 & -\sum X_i \\ -\sum X_i & n \end{bmatrix}

4. Assemble β^=(XTX)1XTy\bm{\hat{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y}:

[β^0β^1]=1n(XiXˉ)2[Xi2XiXin][YiXiYi]\begin{bmatrix} \hat{\beta}_0 \\ \hat{\beta}_1 \end{bmatrix} = \frac{1}{n \sum(X_i - \bar{X})^2} \begin{bmatrix} \sum X_i^2 & -\sum X_i \\ -\sum X_i & n \end{bmatrix} \begin{bmatrix} \sum Y_i \\ \sum X_i Y_i \end{bmatrix}

5. Solve for β^1\hat{\beta}_1 (the second row):

β^1=1n(XiXˉ)2((Xi)(Yi)+n(XiYi))\hat{\beta}_1 = \frac{1}{n \sum(X_i - \bar{X})^2} \left( -(\sum X_i)(\sum Y_i) + n(\sum X_i Y_i) \right)
β^1=nXiYiXiYin(XiXˉ)2\hat{\beta}_1 = \frac{n\sum X_i Y_i - \sum X_i \sum Y_i}{n \sum(X_i - \bar{X})^2}

The numerator simplifies to n(XiXˉ)(YiYˉ)n \sum(X_i - \bar{X})(Y_i - \bar{Y}).

β^1=n(XiXˉ)(YiYˉ)n(XiXˉ)2=(XiXˉ)(YiYˉ)(XiXˉ)2\hat{\beta}_1 = \frac{n \sum(X_i - \bar{X})(Y_i - \bar{Y})}{n \sum(X_i - \bar{X})^2} = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sum (X_i - \bar{X})^2}

Q.E.D. The formulas are identical. The matrix method is confirmed.

What's Next? Is Our Line Any Good?

We have done it. We've opened the black box and rigorously derived the exact formulas for the 'best' fitting line using two different methods.

But this only gives us the line itself. It doesn't tell us how well that line actually fits the data. Does it explain 80% of the variation in our Y variable, or only 2%?

In the next lesson, we will develop the tools to answer this, including the single most famous statistic in data analysis: **R-Squared (R²)**.

Up Next: How Good is Our Fit? R-Squared and Residuals