Lesson 5.6: The Mathematics of PCA

We have the intuition for PCA: find the directions of maximum variance. Now we prove why this problem is mathematically identical to finding the eigenvectors of the covariance matrix. This lesson is a masterclass in applying the linear algebra of eigendecomposition to a real-world statistical problem.

Part 1: Framing the Problem - Maximizing Variance

Let $\mathbf{X}$ be our $n \times p$ data matrix, centered and scaled. A single observation (row) is $\mathbf{x}_i$ . The sample covariance matrix is $\mathbf{\Sigma} = \frac{1}{n-1}\mathbf{X}^T\mathbf{X}$ .

A Principal Component is a linear combination of the original features. We can define the first Principal Component, $\mathbf{w}_1$ , as a $p \times 1$ vector of weights.

\mathbf{w}_1 = [w_{11}, w_{12}, \dots, w_{1p}]^T

The "score" of the i-th observation on this component is the projection of that observation onto the weight vector:

z_{i1} = \mathbf{x}_i \cdot \mathbf{w}_1 = w_{11}x_{i1} + w_{12}x_{i2} + \dots + w_{1p}x_{ip}

The vector of all scores is $\mathbf{z}_1 = \mathbf{X}\mathbf{w}_1$ . PCA's goal is to find the weights $\mathbf{w}_1$ that **maximize the variance** of these scores.

The Optimization Problem for PC1

Maximize the variance of the scores, subject to the constraint that the weight vector has a length of 1 (to ensure a unique solution).

\max_{\mathbf{w}_1} \text{Var}(\mathbf{X}\mathbf{w}_1) \quad \text{subject to} \quad \|\mathbf{w}_1\|^2 = \mathbf{w}_1^T\mathbf{w}_1 = 1

Part 2: The Eigendecomposition Solution

Let's simplify the objective function. The variance of a linear combination is given by $\text{Var}(\mathbf{Xw}) = \mathbf{w}^T \text{Var}(\mathbf{X}) \mathbf{w} = \mathbf{w}^T \mathbf{\Sigma} \mathbf{w}$ (from Module 2).

So our problem is now:

\max_{\mathbf{w}_1} \mathbf{w}_1^T \mathbf{\Sigma} \mathbf{w}_1 \quad \text{subject to} \quad \mathbf{w}_1^T\mathbf{w}_1 = 1

The Lagrangian and the Eigenvector Connection

This is a constrained optimization problem, which we solve using a Lagrange multiplier, $\lambda$ .

\mathcal{L}(\mathbf{w}_1, \lambda_1) = \mathbf{w}_1^T \mathbf{\Sigma} \mathbf{w}_1 - \lambda_1 (\mathbf{w}_1^T\mathbf{w}_1 - 1)

To find the maximum, we take the derivative with respect to $\mathbf{w}_1$ and set it to zero:

\frac{\partial \mathcal{L}}{\partial \mathbf{w}_1} = 2\mathbf{\Sigma}\mathbf{w}_1 - 2\lambda_1\mathbf{w}_1 = 0

This simplifies to:

\mathbf{\Sigma}\mathbf{w}_1 = \lambda_1\mathbf{w}_1

This is the fundamental eigenvector equation! $A\mathbf{v} = \lambda\mathbf{v}$ .

This stunning result proves that the vector $\mathbf{w}_1$ that maximizes the variance is the **eigenvector** of the covariance matrix $\mathbf{\Sigma}$ .

Which eigenvector? Let's pre-multiply by $\mathbf{w}_1^T$ :

\mathbf{w}_1^T\mathbf{\Sigma}\mathbf{w}_1 = \lambda_1 \mathbf{w}_1^T\mathbf{w}_1

Since $\mathbf{w}_1^T\mathbf{w}_1=1$ , this becomes $\text{Var}(\mathbf{z}_1) = \mathbf{w}_1^T \mathbf{\Sigma} \mathbf{w}_1 = \lambda_1$ . To maximize the variance, we must choose the eigenvector corresponding to the **largest eigenvalue**. This is PC1.

The Principal Components are the Eigenvectors

Principal Component 1 ( $\mathbf{w}_1$ ): The eigenvector of $\mathbf{\Sigma}$ corresponding to the largest eigenvalue, $\lambda_1$ .
Principal Component 2 ( $\mathbf{w}_2$ ): The eigenvector corresponding to the second-largest eigenvalue, $\lambda_2$ , and so on.

Part 3: Explained Variance - The Role of Eigenvalues

We've shown that the variance of the scores on a principal component is equal to the component's eigenvalue:

\text{Var}(\mathbf{z}_j) = \lambda_j

This gives us a natural way to measure the "importance" of each component. The total variance in the dataset is the sum of the variances of all its features, which is also equal to the sum of all the eigenvalues (the trace of $\mathbf{\Sigma}$ ):

\text{Total Variance} = \sum_{j=1}^p \text{Var}(X_j) = \sum_{j=1}^p \lambda_j

Proportion of Variance Explained

The proportion of total variance explained by the j-th principal component is:

\frac{\lambda_j}{\sum_{i=1}^p \lambda_i}

To find the cumulative variance explained by the first $k$ components, we just sum their individual proportions. This is what we look at to decide how many components to keep.

What's Next? Putting PCA to Work

We have now forged the complete theoretical link between the statistical goal of maximizing variance and the linear algebra tool of eigendecomposition.

It's time to see how this powerful technique is actually used in quantitative finance. In the next lesson, we will explore practical applications, such as using PCA to build custom market indices, create statistical risk factors, and denoise correlation matrices for more robust portfolio optimization.

The Intuition of PCA (Principal Component Analysis): Finding the Directions of Maximum Variance

Applications of PCA in Finance: Creating Index Factors and Denoising Correlation Matrices