Lesson 5.6: The Mathematics of PCA

We have the intuition for PCA: find the directions of maximum variance. Now we prove why this problem is mathematically identical to finding the eigenvectors of the covariance matrix. This lesson is a masterclass in applying the linear algebra of eigendecomposition to a real-world statistical problem.

Part 1: Framing the Problem - Maximizing Variance

Let X\mathbf{X} be our n×pn \times p data matrix, centered and scaled. A single observation (row) is xi\mathbf{x}_i. The sample covariance matrix is Σ=1n1XTX\mathbf{\Sigma} = \frac{1}{n-1}\mathbf{X}^T\mathbf{X}.

A Principal Component is a linear combination of the original features. We can define the first Principal Component, w1\mathbf{w}_1, as a p×1p \times 1 vector of weights.

w1=[w11,w12,,w1p]T\mathbf{w}_1 = [w_{11}, w_{12}, \dots, w_{1p}]^T

The "score" of the i-th observation on this component is the projection of that observation onto the weight vector:

zi1=xiw1=w11xi1+w12xi2++w1pxipz_{i1} = \mathbf{x}_i \cdot \mathbf{w}_1 = w_{11}x_{i1} + w_{12}x_{i2} + \dots + w_{1p}x_{ip}

The vector of all scores is z1=Xw1\mathbf{z}_1 = \mathbf{X}\mathbf{w}_1. PCA's goal is to find the weights w1\mathbf{w}_1 that **maximize the variance** of these scores.

The Optimization Problem for PC1

Maximize the variance of the scores, subject to the constraint that the weight vector has a length of 1 (to ensure a unique solution).

maxw1Var(Xw1)subject tow12=w1Tw1=1\max_{\mathbf{w}_1} \text{Var}(\mathbf{X}\mathbf{w}_1) \quad \text{subject to} \quad \|\mathbf{w}_1\|^2 = \mathbf{w}_1^T\mathbf{w}_1 = 1

Part 2: The Eigendecomposition Solution

Let's simplify the objective function. The variance of a linear combination is given by Var(Xw)=wTVar(X)w=wTΣw\text{Var}(\mathbf{Xw}) = \mathbf{w}^T \text{Var}(\mathbf{X}) \mathbf{w} = \mathbf{w}^T \mathbf{\Sigma} \mathbf{w} (from Module 2).

So our problem is now:

maxw1w1TΣw1subject tow1Tw1=1\max_{\mathbf{w}_1} \mathbf{w}_1^T \mathbf{\Sigma} \mathbf{w}_1 \quad \text{subject to} \quad \mathbf{w}_1^T\mathbf{w}_1 = 1

The Lagrangian and the Eigenvector Connection

This is a constrained optimization problem, which we solve using a Lagrange multiplier, λ\lambda.

L(w1,λ1)=w1TΣw1λ1(w1Tw11)\mathcal{L}(\mathbf{w}_1, \lambda_1) = \mathbf{w}_1^T \mathbf{\Sigma} \mathbf{w}_1 - \lambda_1 (\mathbf{w}_1^T\mathbf{w}_1 - 1)

To find the maximum, we take the derivative with respect to w1\mathbf{w}_1 and set it to zero:

Lw1=2Σw12λ1w1=0\frac{\partial \mathcal{L}}{\partial \mathbf{w}_1} = 2\mathbf{\Sigma}\mathbf{w}_1 - 2\lambda_1\mathbf{w}_1 = 0

This simplifies to:

Σw1=λ1w1\mathbf{\Sigma}\mathbf{w}_1 = \lambda_1\mathbf{w}_1

This is the fundamental eigenvector equation! Av=λvA\mathbf{v} = \lambda\mathbf{v}.

This stunning result proves that the vector w1\mathbf{w}_1 that maximizes the variance is the **eigenvector** of the covariance matrix Σ\mathbf{\Sigma}.

Which eigenvector? Let's pre-multiply by w1T\mathbf{w}_1^T:

w1TΣw1=λ1w1Tw1\mathbf{w}_1^T\mathbf{\Sigma}\mathbf{w}_1 = \lambda_1 \mathbf{w}_1^T\mathbf{w}_1

Since w1Tw1=1\mathbf{w}_1^T\mathbf{w}_1=1, this becomes Var(z1)=w1TΣw1=λ1\text{Var}(\mathbf{z}_1) = \mathbf{w}_1^T \mathbf{\Sigma} \mathbf{w}_1 = \lambda_1. To maximize the variance, we must choose the eigenvector corresponding to the **largest eigenvalue**. This is PC1.

The Principal Components are the Eigenvectors

  • Principal Component 1 (w1\mathbf{w}_1): The eigenvector of Σ\mathbf{\Sigma} corresponding to the largest eigenvalue, λ1\lambda_1.
  • Principal Component 2 (w2\mathbf{w}_2): The eigenvector corresponding to the second-largest eigenvalue, λ2\lambda_2, and so on.

Part 3: Explained Variance - The Role of Eigenvalues

We've shown that the variance of the scores on a principal component is equal to the component's eigenvalue:

Var(zj)=λj\text{Var}(\mathbf{z}_j) = \lambda_j

This gives us a natural way to measure the "importance" of each component. The total variance in the dataset is the sum of the variances of all its features, which is also equal to the sum of all the eigenvalues (the trace of Σ\mathbf{\Sigma}):

Total Variance=j=1pVar(Xj)=j=1pλj\text{Total Variance} = \sum_{j=1}^p \text{Var}(X_j) = \sum_{j=1}^p \lambda_j

Proportion of Variance Explained

The proportion of total variance explained by the j-th principal component is:

λji=1pλi\frac{\lambda_j}{\sum_{i=1}^p \lambda_i}

To find the cumulative variance explained by the first kk components, we just sum their individual proportions. This is what we look at to decide how many components to keep.

What's Next? Putting PCA to Work

We have now forged the complete theoretical link between the statistical goal of maximizing variance and the linear algebra tool of eigendecomposition.

It's time to see how this powerful technique is actually used in quantitative finance. In the next lesson, we will explore practical applications, such as using PCA to build custom market indices, create statistical risk factors, and denoise correlation matrices for more robust portfolio optimization.