Lesson 1.5: Our First Continuous Model - The Anatomy of Simple Linear Regression

We now move from classification to regression. In this deep dive, we will build our first model to predict a continuous value. We'll explore the elegant math of fitting a line—using both calculus and linear algebra—and implement it in Python to see how a model truly 'learns' the line of best fit.

Part 1: The Goal - Finding the Trend

Imagine we have a simple dataset from our quant fund: the years of experience for several analysts and their corresponding annual salary. We plot this data on a scatter plot.

Imagine a scatter plot. As 'Years of Experience' (x-axis) increases, 'Salary' (y-axis) also tends to increase. The points form a rough, upward-sloping cloud.

Our eyes can naturally see a trend in this data. The goal of **Simple Linear Regression** is to formally capture this trend by finding the **single straight line** that best represents the relationship between experience and salary.

Part 2: The Language - The Mathematics of a Line

A straight line is defined by a simple, elegant equation that should be familiar from high school algebra. For a single feature $x$ , our model's prediction, denoted $\hat{y}$ ("y-hat"), is given by:

\hat{y} = w x + b

The "learning" process for this model is simply the process of finding the optimal values for two numbers, called **parameters**:

The Two Parameters of a Line

$w$ (the **Weight** or **Slope**): This number represents the strength and direction of the relationship. In our example, it answers the question: "For one additional year of experience, how much does the salary increase, on average?"
$b$ (the **Bias** or **Intercept**): This is the starting point of the line. It's the predicted value of $y$ when $x$ is zero. In our case, it would be the predicted salary for an analyst with zero years of experience.

Part 3: How a Model 'Learns' - Minimizing Error

How do we find the "best" values for $w$ and $b$ ? We first need a way to measure how "wrong" a particular line is. We do this by calculating the **residuals**.

Residuals: The Measure of Error

A residual is the vertical distance between an actual data point ( $y_i$ ) and the model's prediction for that point ( $\hat{y}_i$ ) on the line.

\text{Residual}_i = y_i - \hat{y}_i

You can think of each residual as the "error" for a single data point. Some will be positive (the line was too low) and some will be negative (the line was too high).

To get the total error for the entire dataset, we combine all the individual residuals into a single number called a **loss function**. The most common loss function for regression is the **Mean Squared Error (MSE)**, also called the cost function $J(w, b)$ .

J(w, b) = \text{MSE} = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{1}{n}\sum_{i=1}^n (y_i - (w x_i + b))^2

We square the residuals so that positive and negative errors don't cancel each other out, and to heavily penalize larger errors.

The "best" line is the one whose parameters $w$ and $b$ result in the **lowest possible MSE**. Our learning problem has now become an **optimization problem**.

Part 4: The Beauty of the Solution - Calculus and Linear Algebra

How do we find the bottom of the MSE curve? Both calculus and linear algebra offer elegant solutions.

The Calculus Approach

The MSE loss function, $J(w, b)$ , forms a smooth, bowl-shaped surface (a convex paraboloid). From calculus, we know that the absolute minimum of a convex function is the point where its derivatives are equal to zero.

To find the optimal parameters, we take the partial derivatives of the loss function with respect to each parameter, set them to zero, and solve the resulting system of equations:

\frac{\partial J}{\partial w} = 0 \quad \text{and} \quad \frac{\partial J}{\partial b} = 0

This process, while tedious by hand, gives a precise, closed-form solution for the best $w$ and $b$ . This is what the computer does instantly when you call `.fit()` using a method called Ordinary Least Squares (OLS).

The Linear Algebra Approach

Linear algebra views the problem not as a sum, but as an operation on vectors and matrices. We can represent our entire dataset in a matrix equation:

\mathbf{y} \approx \mathbf{X}\mathbf{w}

Here, $\mathbf{y}$ is the vector of all salaries, $\mathbf{X}$ is a matrix containing a column of ones (for the intercept $b$ ) and a column of experience values, and $\mathbf{w}$ is the vector containing our parameters $[b, w]^T$ .

Linear algebra provides a direct, beautiful formula called the **Normal Equation** to solve for the optimal parameter vector $\mathbf{\hat{w}}$ all at once:

\mathbf{\hat{w}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}

This single line of math is the foundation of all linear models and is one of the most famous equations in statistics. It solves for all parameters simultaneously.

Part 5: SLR in Action - A Python Implementation

Now let's translate this theory into code. We'll build a model to predict salary from experience.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# 1. Create Realistic Dummy Data
np.random.seed(42) # for reproducibility
# Feature: Years of Experience (0-12 years)
X = np.random.rand(100, 1) * 12 
# Label: Salary = 50k (base) + 5k/year + some random noise
y = 50000 + 5000 * X.flatten() + np.random.randn(100) * 8000

# 2. Split the Data (The Golden Rule)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Create and Train the Model
# Scikit-learn handles all the calculus/linear algebra for us!
model = LinearRegression()
model.fit(X_train, y_train)

# 4. Examine the Learned Parameters
w_learned = model.coef_[0]
b_learned = model.intercept_
print(f"The model learned the following equation:")
print(f"Salary = {w_learned:.2f} * Experience + {b_learned:.2f}")
print("---")
print("This is very close to our true underlying relationship of y = 5000x + 50000!")

# 5. Make Predictions on the Test Set
y_pred = model.predict(X_test)

# 6. Visualize the Results
plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color='blue', label='Actual Salaries')
plt.plot(X_test, y_pred, color='red', linewidth=3, label='Predicted Line of Best Fit')
plt.title('Salary vs. Experience (Test Set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary ($)')
plt.legend()
plt.grid(True)
plt.show()

What's Next? How Do We Grade Our Model?

We have successfully built and visualized our first regression model. We can see that the red line fits the blue dots quite well. But in a professional setting, "looks good" is not enough. We need to quantify its performance with precise numbers.

How do we measure the average error of our salary predictions? How much of the difference in salaries can be explained by experience alone?

Before we can move on to more complex models, we must first master the art of evaluation. The next two lessons are a deep dive into the scoring systems for both classification and regression, starting with the all-important **Confusion Matrix**.

Lesson 1.4: Your First Predictive Model (Intuition): K-Nearest Neighbors (KNN)

Lesson 1.6: Our First Scoring System: Accuracy, Confusion Matrix, Precision, Recall, F1-Score