Lesson 5.4: The Problem with the Normal Equations

Exploring the numerical instability of AᵀA and the concept of ill-conditioning.

In our last lesson, we successfully used the Normal Equations, ATAx^=ATbA^TA\hat{x} = A^Tb, as a powerful recipe to find the line of best fit. It felt clean, simple, and definitive. It is the textbook method and the foundation of our theoretical understanding of least squares.

However, when we move from small, neat textbook examples to the large, messy datasets of the real world, a dark side of the Normal Equations can emerge. Direct computation of ATAA^TA can sometimes lead to serious numerical problems.

Today, we will play the role of a numerical analyst and investigate the primary weakness of this method: the problem of ill-conditioning.

What is 'Conditioning'?

In numerical linear algebra, the condition number of a matrix is a measure of how sensitive the output of a problem is to small changes in the input data.

  • A matrix with a low condition number is well-conditioned. Small errors or noise in your input data `b` will only lead to small errors in your solution `x̂`. This is a stable, reliable situation.
  • A matrix with a very high condition number is ill-conditioned. Even tiny, unavoidable rounding errors in your input data can be magnified into massive, catastrophic errors in your final solution. This is a dangerous, unstable situation.

The Vicious Multiplier: `cond(AᵀA) = (cond(A))²`

The matrix we actually have to solve our system with is not `A`, but the new, square matrix `K = AᵀA`. It is a fundamental property of linear algebra that the condition number of `K` is related to the condition number of `A` in a very dramatic way.

condition_number(ATA)=(condition_number(A))2\text{condition\_number}(A^TA) = (\text{condition\_number}(A))^2

The process of forming ATAA^TA squares the condition number. Why is this so dangerous?

  • If `A` is slightly ill-conditioned, say `cond(A) = 100`, then `cond(AᵀA) = 100² = 10,000`.
  • If `A` is significantly ill-conditioned, say `cond(A) = 10⁷`, then `cond(AᵀA) = (10⁷)² = 10¹⁴`.

The Need for a Better Way

The Normal Equations are a perfect theoretical tool for understanding the geometry of least squares. However, for serious numerical work, directly computing ATAA^TA is often avoided.

We need a more sophisticated algorithm that can solve the least squares problem without ever explicitly forming the `AᵀA` matrix. This need for a stable, high-precision method is the entire motivation for our next topics: the Gram-Schmidt process and the beautiful QR Decomposition.