The Geometry of "Best Fit": Projections

Building the mathematical machinery to find the closest possible solution.

In our last lesson, we faced a hard truth: most real-world systems `Ax=b` have no solution. We redefined our goal: instead of trying to hit the unreachable target `b`, we will aim for the closest point to it that we *can* hit.

This closest point, we said, is the **orthogonal projection** of `b` onto the Column Space of `A`. Today, we build the geometric machinery to find that projection. We will start with the simplest case imaginable and build up to the general, powerful formula.

Part 1: Projection onto a Line
The simplest case is projecting one vector onto another. This forms the basis for everything else.

Imagine a single vector `a` in space, which defines a line. Now, imagine another vector `b` that is not on this line. The closest point on the line to `b` is the **orthogonal projection of `b` onto `a`**, which we'll call `p`.

The key feature is that the error vector, e=bpe = b - p, must be **orthogonal** to the vector `a` that defines the line. This means their dot product is zero:

a(bp)=0a \cdot (b - p) = 0

We also know `p` must be a scaled version of `a`, so p=x^ap = \hat{x}a for some scalar x^\hat{x}. Substituting this in:

a(bx^a)=0    abx^(aa)=0a \cdot (b - \hat{x}a) = 0 \implies a \cdot b - \hat{x}(a \cdot a) = 0

Solving for our unknown scalar x^\hat{x} gives:

x^=abaa=aTbaTa\hat{x} = \frac{a \cdot b}{a \cdot a} = \frac{a^Tb}{a^Ta}

And the projection vector `p` itself is:

p=x^a=(aTbaTa)ap = \hat{x}a = \left( \frac{a^Tb}{a^Ta} \right) a
Part 2: Projection onto a Subspace
Now we generalize this to project a vector `b` onto an entire subspace, like the Column Space of a matrix `A`.

Let the Column Space of `A` be spanned by linearly independent basis vectors `a₁, a₂, ..., aₙ`. The projection `p` is in this space, so it must be a linear combination of these basis vectors:

p=x^1a1+x^2a2++x^nan=Ax^p = \hat{x}_1 a_1 + \hat{x}_2 a_2 + \dots + \hat{x}_n a_n = A\hat{x}

Here, `x̂` is the vector of coefficients we need to find. The error `e = b - p` must be orthogonal to the *entire* subspace, meaning it's orthogonal to every basis vector `aᵢ`.

{a1T(bAx^)=0a2T(bAx^)=0anT(bAx^)=0\begin{cases} a_1^T(b - A\hat{x}) = 0 \\ a_2^T(b - A\hat{x}) = 0 \\ \vdots \\ a_n^T(b - A\hat{x}) = 0 \end{cases}

This entire system of equations can be written in a single, compact matrix form:

AT(bAx^)=0A^T(b - A\hat{x}) = 0

Rearranging this gives us the magnificent **Normal Equations**:

ATAx^=ATbA^TA\hat{x} = A^Tb

We have converted our original, unsolvable system `Ax=b` into a new, smaller, **always solvable** square system that gives us the best approximate solution `x̂`.

**Up Next:** We will take the Normal Equations we just derived and use them as our primary tool to solve a real-world linear regression problem from start to finish.