Lesson 4.5: The Classical Linear Model (CLM) Assumptions
This lesson forms the theoretical bedrock of the entire module. We will rigorously define the set of assumptions under which the Ordinary Least Squares (OLS) estimator is guaranteed to have desirable properties. These assumptions are the 'rules of the game' that, if met, allow us to prove that OLS is the 'Best Linear Unbiased Estimator' (BLUE) via the Gauss-Markov Theorem.
Part 1: The 'BLUE' Promise
We have derived the OLS estimator formula, . But why should we use this estimator over any other? The answer lies in the **Gauss-Markov Theorem**, which states that if a set of assumptions holds, OLS is **BLUE**.
Deconstructing 'BLUE'
BLUE is an acronym for the **Best Linear Unbiased Estimator**. It is the gold standard for estimators in econometrics.
- Best: 'Best' means having the minimum variance among all estimators in its class. It is the most precise, or most efficient.
- Linear: The estimator is a linear function of the dependent variable, . We can see this in the formula: , where .
- Unbiased: The expected value of the estimator is the true parameter: . On average, our estimates are correct.
- Estimator: It is a rule that tells us how to use data to estimate an unknown parameter.
The Gauss-Markov Theorem is a contract: if your model and data satisfy the following assumptions, then OLS is the best possible linear unbiased estimator. We will now detail these assumptions with extreme care.
Part 2: The Gauss-Markov Assumptions (The Rules of the Game)
The first five assumptions are collectively known as the Gauss-Markov assumptions. They are sufficient to prove that OLS is BLUE.
Mathematical Statement: The model is linear in the parameters and the error term .
Explanation: This assumption dictates the functional form of the model. It means the parameters are simple coefficients and are not, for example, squared () or used as exponents (). It does *not* mean the relationship must be a straight line. A model like is still linear in parameters, even though it describes a parabola.
Consequence of Violation: If the true relationship is non-linear in parameters, the OLS estimator is fundamentally misspecified. It will be biased and inconsistent, describing a relationship that does not exist.
Mathematical Statement: The design matrix has full column rank. This means and the matrix is invertible.
Explanation: This means none of the independent variables is a perfect linear combination of the others. There is no redundant information. For example, you cannot include a variable for "height in inches" and another variable for "height in centimeters" in the same model, as one is a perfect multiple of the other.
Consequence of Violation: If perfect multicollinearity exists, the inverse cannot be computed. The OLS estimator is not uniquely defined. Statistical software will either crash or automatically drop one of the redundant variables.
Mathematical Statement: The conditional mean of the error term, given all values of the independent variables, is zero.
Explanation: This is the most important and most frequently violated assumption. It means that the unobserved factors that affect (and are contained in ) are, on average, uncorrelated with our observed predictors . There is no systematic relationship between what we can see and what we can't see.
Consequence of Violation: If exogeneity is violated (a condition called **endogeneity**), the OLS estimator becomes **biased and inconsistent**. It will never converge to the true , even with infinite data. The coefficients are meaningless and cannot be interpreted as causal effects. This is the most catastrophic failure of a regression model.
Common Cause: Omitted Variable Bias. If we model `wage` as a function of `education`, but omit the variable `innate ability`, this assumption is violated because `ability` is in the error term but is also correlated with `education`.
Mathematical Statement: The variance-covariance matrix of the errors, conditional on , is a spherical error covariance matrix.
Explanation: This assumption has two parts:
- Homoskedasticity: The diagonal elements are constant. for all . The amount of "noise" is the same for all observations.
- No Autocorrelation: The off-diagonal elements are zero. for . The errors are uncorrelated with each other.
Consequence of Violation: If either part is violated (leading to heteroskedasticity or autocorrelation), the OLS estimator remains **unbiased and consistent**, but it is no longer **Best**. It is not the most efficient estimator. More importantly, the standard formula for the variance of the estimator, , is **incorrect**. This means all our standard errors, t-tests, and F-tests are invalid.
Part 3: The Normality Assumption (For Inference)
There is a final, optional assumption that is not required for the Gauss-Markov theorem, but is necessary for exact finite-sample inference.
Mathematical Statement: The error terms, conditional on , are normally distributed.
Explanation: This assumption combines the exogeneity and spherical error assumptions and adds that the distribution of the unobserved factors follows a bell curve.
Consequence: If this assumption holds, then the OLS estimator is also normally distributed in finite samples. This means our t-tests and F-tests are **exact**, not just asymptotically valid. Without this assumption, we must rely on the Central Limit Theorem and large sample sizes for our inference to be approximately correct.
- If Assumptions 1-4 hold (Gauss-Markov conditions), then OLS is the **Best Linear Unbiased Estimator (BLUE)**.
- If Assumptions 1-5 hold (the full CLM), then our t-statistics and F-statistics have exact t and F distributions, respectively, in any sample size.
- Violating Assumption 3 (Exogeneity) is the most severe failure, rendering our coefficient estimates meaningless.
- Violating Assumption 4 (Homoskedasticity/No Autocorrelation) is less severe but invalidates our standard errors and statistical tests, requiring correction.
What's Next? Proving the Promise
We have laid out the complete set of rules for the Classical Linear Model. We have stated that if these rules are followed, OLS is the "Best Linear Unbiased Estimator."
But a statement is not a proof. In the next lesson, we will undertake the formal, 'no-skip' mathematical proof of the **Gauss-Markov Theorem**, demonstrating with matrix algebra that under these assumptions, no other linear unbiased estimator can have a smaller variance than OLS.