Lesson 2.7: The Assumptions of Linear Models

We've built and evaluated our models. Now, we become detectives. This lesson covers the essential diagnostic checks for linear models, teaching you to identify and fix common problems like multicollinearity, heteroskedasticity, and non-normality of residuals.

Part 1: The 'Health Check' for Your Model

The OLS (Ordinary Least Squares) estimator is remarkably robust. However, for its estimates to be the **Best Linear Unbiased Estimator (BLUE)** and for our t-tests and confidence intervals to be reliable, a set of assumptions must hold. Violating them doesn't necessarily mean your model is useless, but it does mean you can't trust its standard errors or p-values without correction.

We will focus on three key diagnostic checks that every quant must perform.

Part 2: Multicollinearity - The Redundancy Problem

The 'Two Tour Guides' Problem

Multicollinearity occurs when two or more of your predictor variables are highly correlated with each other. They are essentially telling the same story. If two tour guides stand next to each other and say the same thing, you can't tell which one was more helpful.

Consequence: The OLS model can't disentangle the individual effect of each variable. This leads to **unstable coefficients** that can swing wildly and have **inflated standard errors**, making them appear statistically insignificant even when they are important.

Diagnosis & Correction

Diagnosis: Variance Inflation Factor (VIF). The VIF for a predictor $X_j$ is calculated as $1 / (1 - R_j^2)$ , where $R_j^2$ is the R-squared from regressing $X_j$ on all other predictors.

A VIF of 1 means no correlation.
A VIF between 1 and 5 is generally considered moderate and acceptable.
A VIF greater than 5 or 10 indicates high multicollinearity.

Correction:

Drop one variable: If two variables are highly correlated, remove one.
Combine variables: Create a new interaction or ratio feature.
Use Regularization: Ridge (L2) regression is specifically designed to handle multicollinearity.

from statsmodels.stats.outliers_influence import variance_inflation_factor

# Assuming 'X' is your pandas DataFrame of features
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
print(vif_data)

Part 3: Heteroskedasticity - The Inconsistent Noise Problem

The 'Fan-Shaped' Residuals

Heteroskedasticity occurs when the variance of the error terms is not constant across all observations. This is a common violation of the CLM assumptions.

Analogy: Imagine predicting food spending based on income. The error in your prediction for low-income individuals might be small (e.g., ±$50), but the error for high-income individuals could be huge (e.g., ±$2,000). The "noise" is not consistent. A plot of residuals versus predicted values will show a "fan" or "cone" shape.

Consequence: The OLS coefficients are still unbiased, but the standard errors are wrong. Without correction, your t-tests and p-values are invalid.

Diagnosis & Correction

Diagnosis: Breusch-Pagan Test. This is a formal test where the null hypothesis is homoskedasticity (constant variance). A low p-value (< 0.05) means you reject the null and conclude heteroskedasticity is present.

Correction: Use Robust Standard Errors. This is the most common and practical solution. Instead of changing the model, you just use a different formula to calculate the standard errors that is consistent even in the presence of heteroskedasticity. In `statsmodels`, this is done by specifying `cov_type='HC1'` (or HC0, HC3, etc.) when fitting the model.

import statsmodels.api as sm
from statsmodels.stats.diagnostic import het_breuschpagan

# Fit a standard OLS model
X_with_const = sm.add_constant(X_train)
model = sm.OLS(y_train, X_with_const).fit()

# Perform the Breusch-Pagan test on the residuals
bp_test = het_breuschpagan(model.resid, model.model.exog)
print(f"Breusch-Pagan p-value: {bp_test[1]}")

# If p-value is low, re-fit the model with robust standard errors
robust_model = sm.OLS(y_train, X_with_const).fit(cov_type='HC1')
print(robust_model.summary())

Part 4: Non-Normality of Residuals

The 'Bell Curve' of Errors

The classical linear model assumes that the error terms ( $\epsilon_i$ ) are normally distributed. This assumption is not required for OLS coefficients to be unbiased, but it is technically required for the t-tests and F-tests to be valid in small samples.

Consequence: In small samples, if the residuals are heavily non-normal (e.g., very skewed), the p-values from the t-tests may not be reliable. However, thanks to the **Central Limit Theorem**, as the sample size gets larger (typically n &gt 30 or 50), the distribution of the $\hat{\beta}$ coefficients will be approximately normal anyway, making this assumption less critical in practice with large datasets.

Diagnosis

Diagnosis: Jarque-Bera Test. This test checks if the skewness and kurtosis of the residuals match those of a normal distribution. The null hypothesis is that the residuals are normally distributed. A low p-value indicates non-normality.

Correction:

If the sample is large, you may not need to do anything.
Try transforming the dependent variable (e.g., taking the log of Y).
Consider a different type of model (e.g., Generalized Linear Models) that doesn't assume normal errors.

from statsmodels.stats.stattools import jarque_bera

# Use the residuals from the fitted model
jb_test = jarque_bera(model.resid)
print(f"Jarque-Bera p-value: {jb_test[1]}")

What's Next? The Capstone Project

You are now a fully-equipped detective for linear models. You know how to build them, evaluate them, and diagnose their most common illnesses.

It's time to put all of this knowledge together. In the final lesson of Module 2, we will tackle a realistic quantitative finance problem from start to finish: **Building a Credit Default Predictor** using Logistic Regression.

Lesson 2.6: Evaluating Classifiers: Precision-Recall vs. ROC/AUC

Lesson 2.8: Capstone Project: Building a Credit Default Predictor