Lesson 4.6: The Gauss-Markov Theorem and the BLUE Property

This is the most important theoretical result in classical econometrics. We will rigorously prove that, under the Classical Linear Model assumptions, the Ordinary Least Squares (OLS) estimator is the 'Best Linear Unbiased Estimator' (BLUE). This theorem is the fundamental justification for why OLS is the default method for linear modeling.

Part 1: The OLS Assumptions (The Rules of the Game)

The Gauss-Markov Theorem only holds if a set of assumptions about the true error term ( $\mathbf{\epsilon}$ ) and the data ( $\mathbf{X}$ ) are true. These are often called the classical assumptions.

The Five Gauss-Markov Assumptions (MLR Matrix Form)

Linearity in Parameters: The true relationship is linear: $\mathbf{y} = \mathbf{X} \bm{\beta} + \mathbf{\epsilon}$ .
Random Sample: The data ( $\mathbf{y}, \mathbf{X}$ ) is a random sample of size $n$ .
No Perfect Collinearity: $\mathbf{X}$ must have full column rank, meaning the matrix ( $\mathbf{X}^T\mathbf{X}$ ) is invertible.
Zero Conditional Mean (Exogeneity): $\mathbb{E}[\mathbf{\epsilon} \,|\, \mathbf{X}] = \mathbf{0}$ . The error term is, on average, zero for any value of the predictors.
Homoskedasticity & No Autocorrelation: The variance of the errors is constant ( $\sigma^2$ ), and the errors are uncorrelated with each other. This is expressed in matrix form as:
$\mathbb{E}[\mathbf{\epsilon}\mathbf{\epsilon}^T \,|\, \mathbf{X}] = \sigma^2 \mathbf{I}$

The theorem states that if assumptions 1 through 5 hold, then OLS is the Best Linear Unbiased Estimator (BLUE).

Part 2: Proof of Unbiasedness (The "U" in BLUE)

2.1 Defining Unbiasedness

An estimator $\bm{\hat{\beta}}$ is unbiased if its expected value is equal to the true population parameter $\bm{\beta}$ .

\mathbb{E}[\bm{\hat{\beta}}] = \bm{\beta}

If an estimator is unbiased, it means that if you could run your experiment many times, the average of all the estimated $\bm{\hat{\beta}}$ vectors would converge exactly to the true vector $\bm{\beta}$ .

2.2 The Unbiasedness Proof

We start with the OLS master formula and substitute the true model ( $\mathbf{y} = \mathbf{X}\bm{\beta} + \mathbf{\epsilon}$ ) into it.

Proof that E[β̂_OLS] = β

Step 1: Substitute the True Model

\bm{\hat{\beta}}_{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T\mathbf{y}

\bm{\hat{\beta}}_{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T (\mathbf{X}\bm{\beta} + \mathbf{\epsilon})

Step 2: Expand and Simplify

\bm{\hat{\beta}}_{\text{OLS}} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{X}\bm{\beta} + (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{\epsilon}

Since $(\mathbf{X}^T\mathbf{X})^{-1} (\mathbf{X}^T\mathbf{X})$ is the identity matrix $\mathbf{I}$ :

\bm{\hat{\beta}}_{\text{OLS}} = \mathbf{I}\bm{\beta} + (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{\epsilon}

\bm{\hat{\beta}}_{\text{OLS}} = \bm{\beta} + (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{\epsilon}

Step 3: Take the Expected Value

We take the expected value of both sides, conditional on $\mathbf{X}$ , Since $\mathbf{X}$ is assumed fixed (non-stochastic) in this context, we can pull it outside the expectation operator $\mathbb{E}[\cdot]$ .

\mathbb{E}[\bm{\hat{\beta}}_{\text{OLS}} \,|\, \mathbf{X}] = \mathbb{E}[\bm{\beta} \,|\, \mathbf{X}] + \mathbb{E}[(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{\epsilon} \,|\, \mathbf{X}]

Since $\bm{\beta}$ is a fixed, non-random vector of parameters:

\mathbb{E}[\bm{\hat{\beta}}_{\text{OLS}} \,|\, \mathbf{X}] = \bm{\beta} + (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbb{E}[\mathbf{\epsilon} \,|\, \mathbf{X}]

Step 4: Apply Assumption 4 (Exogeneity)

By the OLS assumption 4 (Zero Conditional Mean, $\mathbb{E}[\mathbf{\epsilon} \,|\, \mathbf{X}] = \mathbf{0}$ ):

\mathbb{E}[\bm{\hat{\beta}}_{\text{OLS}} \,|\, \mathbf{X}] = \bm{\beta} + (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{0}

\mathbb{E}[\bm{\hat{\beta}}_{\text{OLS}} \,|\, \mathbf{X}] = \bm{\beta} + \mathbf{0}

\mathbb{E}[\bm{\hat{\beta}}_{\text{OLS}} \,|\, \mathbf{X}] = \bm{\beta}

The OLS estimator is therefore unbiased.

Part 3: Proof of Efficiency (The "B" in BLUE)

The second part of the theorem proves that OLS is the Best estimator. "Best" means it has the smallest variance (the tightest distribution) among all other linear unbiased estimators.

3.1 Defining the Alternative Linear Unbiased Estimator

Let's define any other linear estimator, $\bm{\hat{\beta}}_{\text{ALT}}$ , as a linear combination of the dependent variable vector $\mathbf{y}$ :

\bm{\hat{\beta}}_{\text{ALT}} = \mathbf{A}\mathbf{y}

For $\bm{\hat{\beta}}_{\text{ALT}}$ to be linear and unbiased, the matrix $\mathbf{A}$ must satisfy certain conditions. For this proof, we define $\mathbf{A}$ such that it differs from the OLS coefficient matrix by some non-zero matrix $\mathbf{C}$ :

\mathbf{A} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T + \mathbf{C}

where the first term is the OLS term $\mathbf{A}_{\text{OLS}}$ .

3.2 The Efficiency Constraint

We must first find the constraint that $\mathbf{C}$ must satisfy to keep $\bm{\hat{\beta}}_{\text{ALT}}$ unbiased.

Constraint for Unbiasedness

We require $\mathbb{E}[\bm{\hat{\beta}}_{\text{ALT}}] = \bm{\beta}$ .

\mathbb{E}[\bm{\hat{\beta}}_{\text{ALT}}] = \mathbb{E}[\mathbf{A}\mathbf{y}] = \mathbb{E}[\mathbf{A}(\mathbf{X}\bm{\beta} + \mathbf{\epsilon})]

\mathbb{E}[\bm{\hat{\beta}}_{\text{ALT}}] = \mathbf{A}\mathbf{X}\bm{\beta} + \mathbf{A}\mathbb{E}[\mathbf{\epsilon}]

Since $\mathbf{A}\mathbb{E}[\mathbf{\epsilon}] = \mathbf{A}\mathbf{0} = \mathbf{0}$ :

\mathbb{E}[\bm{\hat{\beta}}_{\text{ALT}}] = \mathbf{A}\mathbf{X}\bm{\beta}

For $\mathbb{E}[\bm{\hat{\beta}}_{\text{ALT}}] = \bm{\beta}$ , we must have $\mathbf{A}\mathbf{X} = \mathbf{I}$ .

Substituting the definition of $\mathbf{A}$ :

\mathbf{A}\mathbf{X} = \left( (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T + \mathbf{C} \right) \mathbf{X} = \mathbf{I}

(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{X} + \mathbf{C}\mathbf{X} = \mathbf{I}

Since $(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{X} = \mathbf{I}$ :

\mathbf{I} + \mathbf{C}\mathbf{X} = \mathbf{I} \quad \implies \quad \mathbf{C}\mathbf{X} = \mathbf{0}

The constraint on the alternative estimator is that the difference matrix $\mathbf{C}$ must satisfy $\mathbf{C}\mathbf{X} = \mathbf{0}$ .

3.3 Comparing the Variance

The variance-covariance matrix of an estimator is $\mathbf{Varm}(\bm{\hat{\beta}}) = \mathbb{E}[(\bm{\hat{\beta}} - \mathbb{E}[\bm{\hat{\beta}}])(\bm{\hat{\beta}} - \mathbb{E}[\bm{\hat{\beta}}])^T]$ .

Variance of OLS ( $\mathbf{Varm}(\bm{\hat{\beta}}_{\text{OLS}})$ )

\mathbf{Varm}(\bm{\hat{\beta}}_{\text{OLS}}) = \mathbb{E}[(\bm{\hat{\beta}}_{\text{OLS}} - \bm{\beta})(\bm{\hat{\beta}}_{\text{OLS}} - \bm{\beta})^T]

From Step 2.2, we know $\bm{\hat{\beta}}_{\text{OLS}} - \bm{\beta} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{\epsilon}$ .

\mathbf{Varm}(\bm{\hat{\beta}}_{\text{OLS}}) = \mathbb{E}[((\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{\epsilon})((\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{\epsilon})^T]

\mathbf{Varm}(\bm{\hat{\beta}}_{\text{OLS}}) = \mathbb{E}[(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{\epsilon} \mathbf{\epsilon}^T \mathbf{X} ((\mathbf{X}^T\mathbf{X})^{-1})^T]

Applying Assumption 5 ( $\mathbb{E}[\mathbf{\epsilon}\mathbf{\epsilon}^T] = \sigma^2 \mathbf{I}$ ):

\mathbf{Varm}(\bm{\hat{\beta}}_{\text{OLS}}) = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{X} (\mathbf{X}^T\mathbf{X})^{-1}

Variance of the OLS Estimator

\mathbf{Varm}(\bm{\hat{\beta}}_{\text{OLS}}) = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1}

Variance of the Alternative Estimator ( $\mathbf{Varm}(\bm{\hat{\beta}}_{\text{ALT}})$ )

\mathbf{Varm}(\bm{\hat{\beta}}_{\text{ALT}}) = \mathbf{Varm}(\mathbf{A}\mathbf{y}) = \mathbf{Varm}( (\mathbf{A}_{\text{OLS}} + \mathbf{C}) \mathbf{y} )

Under the classical assumptions, $\mathbf{Varm}(\mathbf{y}) = \mathbf{Varm}(\mathbf{\epsilon}) = \sigma^2 \mathbf{I}$ .

\mathbf{Varm}(\bm{\hat{\beta}}_{\text{ALT}}) = (\mathbf{A}_{\text{OLS}} + \mathbf{C}) (\sigma^2 \mathbf{I}) (\mathbf{A}_{\text{OLS}} + \mathbf{C})^T

Expanding the product:

\mathbf{Varm}(\bm{\hat{\beta}}_{\text{ALT}}) = \sigma^2 \left( \mathbf{A}_{\text{OLS}}\mathbf{A}_{\text{OLS}}^T + \mathbf{A}_{\text{OLS}}\mathbf{C}^T + \mathbf{C}\mathbf{A}_{\text{OLS}}^T + \mathbf{C}\mathbf{C}^T \right)

The cross-product terms are zero because $\mathbf{C}\mathbf{X} = \mathbf{0}$ .

The variance equation simplifies to:

\mathbf{Varm}(\bm{\hat{\beta}}_{\text{ALT}}) = \sigma^2 \mathbf{A}_{\text{OLS}}\mathbf{A}_{\text{OLS}}^T + \sigma^2 \mathbf{C}\mathbf{C}^T

We substitute $\sigma^2 \mathbf{A}_{\text{OLS}}\mathbf{A}_{\text{OLS}}^T = \mathbf{Varm}(\bm{\hat{\beta}}_{\text{OLS}})$ :

\mathbf{Varm}(\bm{\tilde{\beta}}) = \mathbf{Varm}(\bm{\hat{\beta}}_{OLS}) + \sigma^2\mathbf{C}\mathbf{C}^T

3.4 Conclusion of the Gauss-Markov Theorem

The difference between the variance of the alternative estimator and the OLS estimator is $\sigma^2 \mathbf{C}\mathbf{C}^T$ . Since $\mathbf{C}$ is a non-zero matrix and $\mathbf{C}\mathbf{C}^T$ is a positive semi-definite matrix (its elements are greater than or equal to zero), the variance of the OLS estimator is smaller than or equal to the variance of any other linear unbiased estimator.

\mathbf{Varm}(\bm{\tilde{\beta}}) \ge \mathbf{Varm}(\bm{\hat{\beta}}_{OLS})

The Gauss-Markov Theorem

Under the five classical OLS assumptions, the Ordinary Least Squares (OLS) estimator $\bm{\hat{\beta}}_{\text{OLS}}$ is the Best Linear Unbiased Estimator (BLUE).

Part 4: Connecting to the Real World (ML & Finance)

4.1 The Machine Learning Connection: Bias-Variance Trade-off

The Gauss-Markov theorem beautifully isolates the Unbiasedness and Variance components, which are the two central concerns in the ML Bias-Variance Trade-off.

High Variance (Not Best): In ML, an OLS estimator with large variance is over-fitting the training data. This is why we sometimes abandon OLS (giving up the "Best" property) for techniques like Ridge Regression or Lasso Regression.
Introducing Bias (Lasso/Ridge): Lasso and Ridge introduce a small amount of intentional bias (meaning $\mathbb{E}[\bm{\hat{\beta}}] \ne \bm{\beta}$ ) to dramatically reduce the estimator's variance. In practice, this trade-off often leads to better predictive performance on unseen data, showing that sometimes being BLUE is less important than being a robust estimator with low variance.

4.2 The Quant Finance Connection: Volatility and Risk

In finance, the variance of an estimator is directly linked to estimation risk.

Estimation Risk: In quantitative trading, when we use OLS to estimate factor exposures (betas) for hedging, we need the most stable, efficient estimate possible. The Gauss-Markov theorem assures the quant that the OLS formula provides the most precise (lowest variance) linear hedge ratios available, *provided the market is well-behaved* (i.e., satisfies the OLS assumptions).
When Assumptions Fail: The moment a market becomes highly volatile (violating the Homoskedasticity assumption, $\mathbb{E}[\mathbf{\epsilon}\mathbf{\epsilon}^T] \ne \sigma^2 \mathbf{I}$ ), the Gauss-Markov theorem is broken. The OLS estimator is no longer BLUE, forcing quants to switch to more advanced methods like the Generalised Least Squares (GLS) or robust standard error methods to recover the "Best" property.

What's Next? (Dealing with Assumptions)

We have now mastered the OLS formula, proved its properties, and seen the critical role of the five Gauss-Markov Assumptions.

In the next lesson, we move from theoretical proof to practical application: we will begin to explore the consequences and detection of the most common OLS assumption failures, starting with testing the significance of individual coefficients.

Lesson 4.5: The Classical Linear Model (CLM) Assumptions

Lesson 4.7: t-Tests for Individual Coefficients