Lesson 3.5: Polynomial Regression: Bending the Line

Linear models are powerful but have a major limitation: they can only find straight lines. This lesson introduces a clever trick to make our simple linear model fit complex, curved data by creating new 'polynomial' features. It's the bridge between linear and non-linear modeling.

Part 1: The Problem of Non-Linearity

Imagine you have data on a company's marketing spend and its resulting sales. At first, more spending leads to more sales, but eventually, the effect levels off (diminishing returns). The relationship is not a straight line; it's a curve.

If we try to fit a Simple Linear Regression model to this data, it will fail badly. The straight line will cut through the curve, being too high in some places and too low in others. It has high bias because its assumption of linearity is wrong.

Imagine a scatter plot of points forming a gentle 'S' curve. A straight regression line is drawn through them, clearly missing the curvature.

Part 2: The Solution - Creating New Features

The core idea of polynomial regression is surprisingly simple. Instead of changing our model, we change our **data**. We can create new features by taking our original feature and raising it to different powers.

The Core Idea: Feature Engineering

If our original model with one feature, XX, is:

Y^=β0+β1X\hat{Y} = \beta_0 + \beta_1 X

We can create a new feature, X2X^2. Now, we can fit a **Multiple Linear Regression** model that looks like this:

Y^=β0+β1X+β2X2\hat{Y} = \beta_0 + \beta_1 X + \beta_2 X^2

This is a quadratic equation, which describes a parabola. Even though the *relationship* between X and Y is now non-linear, the model is still **linear in its parameters** (β0,β1,β2\beta_0, \beta_1, \beta_2). This means we can still use all our familiar OLS machinery to estimate the coefficients!

Part 3: Python Implementation

Scikit-learn makes this process incredibly easy with the `PolynomialFeatures` transformer.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline

# 1. Create some non-linear sample data
np.random.seed(42)
n_samples = 100
X = 2 * np.random.rand(n_samples, 1) - 1 # X from -1 to 1
y = 3 * X**2 + X + 1 + np.random.randn(n_samples, 1) # y = 3x^2 + x + 1 + noise

# 2. Fit a simple linear model (it will fail)
lin_reg = LinearRegression()
lin_reg.fit(X, y)

# 3. Fit a polynomial regression model
# We'll create a 'pipeline' that first creates polynomial features,
# then scales them, then fits the linear model.
poly_reg = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False),
    StandardScaler(),
    LinearRegression()
)
poly_reg.fit(X, y)

# 4. Visualize the results
X_new = np.linspace(-1, 1, 100).reshape(-1, 1)
y_lin_pred = lin_reg.predict(X_new)
y_poly_pred = poly_reg.predict(X_new)

plt.figure(figsize=(10, 6))
plt.scatter(X, y, label='Data')
plt.plot(X_new, y_lin_pred, 'r-', label='Linear Fit')
plt.plot(X_new, y_poly_pred, 'g-', label='Polynomial (degree=2) Fit')
plt.title('Linear vs. Polynomial Regression')
plt.legend()
plt.show()

Part 4: The Danger - Overfitting with High Degrees

Polynomial regression is powerful, but it's easy to get carried away. What happens if we choose a very high degree, like `degree=20`?

The model will have so much flexibility that it will "wiggle" to try and pass through every single data point. It will have very low bias but extremely high variance. It will perfectly fit the training data but fail catastrophically on new, unseen data.

The Bias-Variance Tradeoff in Action
  • Low Degree (e.g., 1): High Bias, Low Variance (Underfitting).
  • High Degree (e.g., 20): Low Bias, High Variance (Overfitting).
  • "Just Right" Degree (e.g., 2): The sweet spot with the best performance on the test set.

The optimal degree is a hyperparameter that must be chosen using a validation set or cross-validation.

What's Next? Finding the Optimal Boundary

We've now seen how to "bend" our linear models to fit curved data. This is a powerful form of feature engineering.

But this raises a new question for classification. Logistic Regression draws a straight line. What if the true boundary between our two classes is a circle or a complex curve? We could try polynomial features, but there's a more elegant, geometric approach.

In the next lesson, we will introduce **Support Vector Machines (SVMs)**, a powerful algorithm that thinks about classification not as fitting a line, but as finding the "widest possible street" between two classes of data.