Lesson 4.0: The Quest for the 'Best' Line: Simple Linear Regression (SLR)
Welcome to Module 4. We now put our theory into practice with the most important model in all of quantitative analysis. This lesson introduces the fundamental problem: how do we find the single best straight line that describes a cloud of data points? This is the foundation of econometrics and predictive modeling.
Part 1: Separating Signal from Noise
Imagine you have data on students' study hours () and their final exam scores (). You plot them and see a rough, upward-trending cloud of points. Your brain immediately sees a pattern: more studying is associated with higher scores. This pattern is the **signal**.
But the relationship isn't perfect. Some students who studied a lot did poorly, and some who studied a little did well. This randomness is the **noise**. The goal of linear regression is to separate the signal from the noise.
We formalize this by assuming a **"true" underlying model**:
The True Model: Signal + Noise
- & : The **true parameters** (intercept and slope) of the signal. They are fixed, unknown constants we want to estimate.
- : The **error term**. This represents all the unobserved factors that affect besides (luck, innate talent, etc.). It's a random variable we can never see.
Part 2: Anatomy of a Regression
Since we can't see the true line, we must use our data to create an **estimated line**, called the **fitted line**. We denote our estimates with "hats" ().
Imagine a scatter plot of data points. A straight line runs through it. For one point (Xi, Yi), a vertical line drops to the regression line (at Y-hat_i). The length of this vertical line is the residual (ei).
- The Fitted Line: Our best guess for the signal. It's the line we actually calculate.
is our **predicted value** of Y for a given X.
- The Residual (): The observable error of our fitted line. It's the difference between the actual data point and our prediction for that point.
The residual is our sample-based estimate of the unobservable true error .
Part 3: Defining the 'Best' Line
How do we find the "best" estimates and ? We need a criterion. The goal is to find the line that makes our observable errors (the residuals, ) as small as possible.
The 'Aha!' Moment: Minimizing Squared Errors
How do we measure the total size of all our errors?
- Sum them ()? Bad idea. Large positive and negative errors would cancel out, making a terrible line look perfect.
- Sum their absolute values ()? Better, but absolute values are difficult to work with in calculus.
- The Genius Idea: Sum the **squares** of the errors ().
- This makes all errors positive.
- It heavily penalizes large errors (an error of 10 has more impact than ten errors of 1).
- It results in a smooth, differentiable function that is easy to minimize with calculus.
The Ordinary Least Squares (OLS) Objective Function
The goal of OLS is to find the specific values of and that **minimize** the **Sum of Squared Residuals (SSR)**.
What's Next? The Derivation
We have defined our objective. We have a clear mountain to climb (or in this case, a valley to find the bottom of). We want to find the estimators and that minimize this SSR function.
This is a classic optimization problem that can be solved with basic calculus. In the next lesson, we will perform the full, "no-skip" mathematical derivation to find the famous formulas for the OLS estimators.