Lesson 2.4: The Geometry of Feature Selection (L1 vs. L2)

In our last lesson, we learned the 'what': Lasso (L1) performs feature selection, while Ridge (L2) only shrinks coefficients. This lesson is about the 'why.' The reason is not algebraic; it is purely geometric. We'll visualize the optimization process to see why the 'diamond' shape of the L1 penalty is the secret to creating sparse models.

Part 1: The Puzzle - Why Does Absolute Value Create Zeros?

This is one of the most beautiful and non-obvious results in machine learning. On the surface, the difference between the Ridge and Lasso penalties seems minor:

Ridge Penalty: $\lambda \sum \beta_j^2$
Lasso Penalty: $\lambda \sum |\beta_j|$

Why does using an absolute value ( $|\beta_j|$ ) instead of a square ( $\beta_j^2$ ) give Lasso the magical ability to force coefficients to be exactly zero? The answer has nothing to do with algebra and everything to do with geometry.

Part 2: Reframing the Problem - Constrained Optimization

To understand the geometry, we must reframe the problem. Minimizing the regularized loss function is mathematically equivalent to solving a **constrained optimization** problem.

Instead of minimizing $\text{SSR} + \text{Penalty}$ , we can think of it as:

"Minimize the Sum of Squared Residuals (SSR), SUBJECT TO the constraint that the total 'size' of your coefficients is less than or equal to some budget $s$ ."

The "size" is defined by the penalty term. This gives us two different problems:

Ridge (L2) Constraint

Minimize SSR subject to:

\sum_{j=1}^k \beta_j^2 \le s

For two coefficients ( $\beta_1, \beta_2$ ), this is the equation of a **circle**: $\beta_1^2 + \beta_2^2 \le s$ . The solution must lie inside or on this circle.

Lasso (L1) Constraint

Minimize SSR subject to:

\sum_{j=1}^k |\beta_j| \le s

For two coefficients, this is the equation of a **diamond**: $|\beta_1| + |\beta_2| \le s$ . The solution must lie inside or on this diamond.

Part 3: The 'Aha!' Moment - Visualizing the Solution

Now, let's visualize the optimization. We have two sets of shapes:

The Loss Contours (Ellipses): The Sum of Squared Residuals (SSR) forms a set of concentric elliptical contour lines. The center of these ellipses is the unconstrained OLS solution ( $\hat{\bm{\beta}}_{OLS}$ ). Our goal is to find the point on the lowest possible ellipse.
The Constraint Region (The "Budget"): The circle (Ridge) or diamond (Lasso) that our solution is not allowed to leave.

The optimal solution to a constrained optimization problem is the **first point where the expanding loss contours touch the constraint region**.

Imagine two diagrams side-by-side. Both have elliptical contours for the loss function.
**Left (Ridge):** A circular constraint boundary. The ellipse touches the circle at a point where both β₁ and β₂ are non-zero.
**Right (Lasso):** A diamond constraint boundary. The ellipse is much more likely to touch one of the sharp corners of the diamond, where one of the coefficients (e.g., β₁) is exactly zero.

The Geometric Insight

Ridge (The Circle): Because the circle has no sharp corners, the ellipses will almost always touch it at a point where *neither* coefficient is zero. The solution is on a smooth part of the boundary. Ridge shrinks both coefficients, but it's statistically very unlikely to shrink one to exactly zero.
Lasso (The Diamond): The diamond has sharp corners located directly on the axes. As the SSR ellipses expand, they are highly likely to make contact with the constraint region at one of these corners. A point on an axis is a point where one of the coefficients is **exactly zero**. This is the geometric reason why Lasso performs feature selection!

Part 4: The Bayesian Interpretation - Priors on Our Parameters

There is another, deeper way to understand this difference, using Bayesian statistics. Regularization is mathematically equivalent to placing a **prior probability distribution** on our coefficients. This "prior" represents our belief about the coefficients *before* we've even seen the data.

Ridge (L2) ↔ Gaussian Prior

Using Ridge regression is equivalent to assuming a **Gaussian (Normal) prior** on each coefficient, centered at zero. $\beta_j \sim \mathcal{N}(0, \tau^2)$

A bell curve is highest at zero but never actually *is* zero. It believes coefficients are probably small, but it doesn't have a strong belief that any are *exactly* zero.

Lasso (L1) ↔ Laplacian Prior

Using Lasso regression is equivalent to assuming a **Laplacian prior** on each coefficient.

A Laplace distribution has a very sharp, pointy peak at zero. It expresses a much stronger prior belief that many coefficients are likely to be *exactly* zero. The model then needs very strong evidence from the data to "pull" a coefficient away from this zero-point peak.

What's Next? Moving from Regression to Classification

We have now completed our deep dive into linear regression and the methods for making it more robust and powerful. We understand how to fit a line, interpret its coefficients, and prevent it from overfitting.

But all of this has been for predicting a *continuous* numerical value (like a stock price or a salary). What if our goal is to predict a *category*? (e.g., "Will this loan default: Yes or No?", "Is this email Spam or Not Spam?").

In the next lesson, we will adapt our linear model for this new task. We will introduce the **Sigmoid function** and build our first classification model: **Logistic Regression**.

Lesson 2.3: Taming Overfitting: The Math of Ridge (L2) and Lasso (L1)

Lesson 2.5: From Regression to Classification: The Logic of Logistic Regression