Lesson 2.3: Taming Overfitting: The Math of Regularization (Ridge & Lasso)

We've built models that learn from data. But what if they learn too well? This lesson introduces Regularization, the single most important technique for preventing overfitting. We'll explore how adding a 'penalty' to our loss function—specifically Ridge (L2) and Lasso (L1)—forces our models to be simpler, more robust, and better at predicting the future.

Part 1: The Problem - When a Model 'Memorizes' the Data

Imagine you have a complex dataset with many predictors. A flexible model, like a high-degree polynomial regression, can find a line that passes through *every single training data point perfectly*. Its error on the training data is zero. It seems like the perfect model.

This is the classic sign of **overfitting**. The model has not learned the true underlying signal; it has simply memorized the random noise in the training data. When you show it new, unseen data, its predictions will be wildly inaccurate. The model has high variance.

The Core Analogy: The Overly Complex Story

Imagine explaining to a child why a specific stock went up.

A Simple (Good) Model: "The company announced great earnings." This captures the main signal.
An Overfit (Bad) Model: "The company announced great earnings, AND the CEO was wearing a blue tie, AND it was a Tuesday, AND the weather in London was cloudy..."

The overfit model is not wrong—all those things were true—but it's useless for prediction. The "blue tie" factor was just random noise, not a real signal. The model has become too complex and has assigned importance to irrelevant details.

Regularization is a technique that forces our model to be simpler. It's like telling the model, "You have a limited 'budget' for complexity. You must focus only on the most important factors."

Part 2: The 'How' - Penalizing Complexity

How do we force a model to be simpler? We change its objective. We add a **penalty term** to the loss function it's trying to minimize.

The New Objective Function

Our old OLS objective was simple: "Find the betas that minimize the error."

\text{Minimize: } \text{Loss} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \text{SSR}

Our new regularized objective is a tradeoff: "Find the betas that minimize the error, **AND** keep the beta coefficients themselves small."

The Regularized Loss Function

\text{Minimize: } \text{Loss} = \text{SSR} + \text{Penalty}

The model now has two competing goals. To make the total Loss small, it must find a delicate balance between fitting the data well (low SSR) and keeping its own coefficients simple (low Penalty).

The "Penalty" term is what defines the type of regularization. The two most famous penalties are the L2 norm (Ridge) and the L1 norm (Lasso).

Part 3: Ridge Regression (L2 Regularization) - The 'Shrinker'

Ridge Regression penalizes the model based on the **sum of the squared beta coefficients**.

The Ridge (L2) Loss Function

\text{Minimize: } \sum (y_i - \mathbf{x}_i^T\bm{\beta})^2 + \lambda \sum_{j=1}^k \beta_j^2

The term $\sum \beta_j^2$ is the squared L2-norm of the coefficient vector.

$\lambda$ (lambda) is the **regularization parameter**. It's a hyperparameter we choose that controls the strength of the penalty.
If $\lambda=0$ , the penalty is zero, and Ridge Regression is just OLS.
If $\lambda$ is very large, the model is forced to make the beta coefficients very small to minimize the loss, even if it means a worse fit to the data.

The Effect of Ridge Regression

Ridge Regression **shrinks** the beta coefficients towards zero. If two predictors are highly correlated, Ridge will shrink both of their coefficients, effectively giving them shared credit. It reduces the impact of multicollinearity.

Important: Ridge will make coefficients very, very close to zero, but it will never make them *exactly* zero. It is good for reducing model complexity but not for feature selection.

Part 4: Lasso Regression (L1 Regularization) - The 'Selector'

Lasso Regression takes a slightly different approach. It penalizes the model based on the **sum of the absolute values of the beta coefficients**.

The Lasso (L1) Loss Function

\text{Minimize: } \sum (y_i - \mathbf{x}_i^T\bm{\beta})^2 + \lambda \sum_{j=1}^k |\beta_j|

The term $\sum |\beta_j|$ is the L1-norm of the coefficient vector.

The Effect of Lasso Regression

This small change from a squared penalty ( $\beta^2$ ) to an absolute value penalty ( $|\beta|$ ) has a profound effect. The L1 penalty is able to shrink some coefficients **all the way to exactly zero**.

This means Lasso performs **automatic feature selection**. It literally removes irrelevant variables from the model, creating a simpler, more interpretable result. If two variables are highly correlated, Lasso will tend to pick one and set the other's coefficient to zero.

Part 5: Practical Implementation and Considerations

Key Practical Steps

Feature Scaling is Mandatory: Before applying regularization, you **must** standardize your features (e.g., using `StandardScaler` from Lesson 1.3). The penalty is applied to the raw coefficients, so if one feature has a much larger scale than another, it will be unfairly penalized.
Choosing Lambda ( $\lambda$ ): The regularization strength $\lambda$ is a critical hyperparameter. You cannot know the best value in advance. It must be found using **cross-validation**. You test a range of $\lambda$ values and choose the one that gives the best performance on your validation set.
Elastic Net: A popular hybrid approach called Elastic Net combines both L1 and L2 penalties, offering the best of both worlds.

from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression

# 1. Generate some data with many features, some irrelevant
X, y = make_regression(n_samples=200, n_features=20, n_informative=10, noise=25, random_state=42)

# 2. Split and Scale Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- 3. Use GridSearchCV to find the best lambda (alpha in scikit-learn) ---

# For Ridge (L2)
ridge = Ridge()
ridge_params = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}
ridge_cv = GridSearchCV(ridge, ridge_params, cv=5)
ridge_cv.fit(X_train_scaled, y_train)
print(f"Best Ridge Alpha: {ridge_cv.best_params_['alpha']}")

# For Lasso (L1)
lasso = Lasso()
lasso_params = {'alpha': [0.1, 0.5, 1, 5, 10]}
lasso_cv = GridSearchCV(lasso, lasso_params, cv=5)
lasso_cv.fit(X_train_scaled, y_train)
print(f"Best Lasso Alpha: {lasso_cv.best_params_['alpha']}")


# --- 4. Analyze the resulting coefficients ---
best_lasso = lasso_cv.best_estimator_
num_zero_coeffs = np.sum(best_lasso.coef_ == 0)
print(f"\nLasso resulted in {num_zero_coeffs} out of 20 coefficients being set to zero.")

What's Next? The Geometric View

We've seen the "what" and "how" of regularization. But *why* does the L1 penalty perform feature selection while the L2 penalty only shrinks?

The answer is not algebraic; it's **geometric**. It has to do with the shape of the "penalty budget" that each method imposes on the coefficients.

In the next lesson, we will do a deep dive into the geometry of L1 vs. L2 regularization, visualizing how the diamond shape of the Lasso penalty is what forces coefficients onto the axes (i.e., to zero).

Lesson 2.2: The Engine of Learning: Gradient Descent and Loss Function Optimization

Lesson 2.4: The Geometry of Feature Selection: L1 vs. L2