Principal Component Analysis (PCA)
The premier algorithm for dimensionality reduction, powered by SVD.
Imagine you are an astronomer who has just collected data on the positions of a million stars in a newly discovered galaxy. Your dataset is a massive table with three columns: `x`, `y`, and `z` coordinates. This is a 3-dimensional dataset. But as you plot the data, you notice something remarkable: the galaxy is almost completely flat, like our own Milky Way. It forms a thin, rotating disk.
While your data lives in 3D, the "interesting" information—the structure of the galaxy—is almost entirely 2-dimensional. The third dimension, the "thickness" of the disk, is mostly just small, random noise. Wouldn't it be great if you could find the perfect 2D plane that best represents your galaxy? This is the exact problem that **Principal Component Analysis (PCA)** solves.
The Goal of PCA: Finding the Directions of Maximum Variance
PCA is a technique for finding a new, optimal coordinate system for your data. The axes of this new system are called **Principal Components**.
- The first principal component (PC1) is the direction in the data with the **largest possible variance**.
- The second principal component (PC2) is the direction with the next largest variance, with the crucial constraint that it must be **orthogonal** to PC1.
- PC3 is the next most important direction orthogonal to both PC1 and PC2, and so on.
- These new axes are **uncorrelated**.
The Crucial First Step: Centering the Data
- Take your original data matrix `X` (where each row is a sample, each column is a feature).
- For each column (feature), calculate its mean (average).
- Subtract the corresponding mean from every entry in that column.
The resulting matrix, let's call it `B`, is your **centered data matrix**. Every column now has a mean of zero.
Path 1: Eigendecomposition of the Covariance Matrix
The traditional statistical approach.
The covariance matrix `S` is computed from the centered data `B`:
S=m−11BTB PCA's goal is equivalent to finding an orthogonal matrix `P` that **diagonalizes the covariance matrix `S`**:
The connection is: the **eigenvectors** of the covariance matrix `S` are the **Principal Components**.
Path 2: The SVD of the Data Matrix
The modern, more direct and stable approach.
We compute the SVD of our centered data matrix `B`: B=UΣVT. Let's substitute this into the covariance formula:
S=m−11(UΣVT)T(UΣVT)=m−11V(ΣTΣ)VT This is the eigendecomposition of `S`! We found it without ever computing `S` directly.
The direct connection: the **right singular vectors of the data matrix `B` (the columns of `V`)** are the **Principal Components**.
The Complete PCA Algorithm (The Recipe)
- Center the Data: Compute the mean of each feature and subtract it to get matrix `B`.
- Compute the SVD: Compute the SVD of `B`: `B = UΣVᵀ`.
- Identify Principal Components: The principal components are the columns of `V`.
- Measure Variance: The variance explained by each component is proportional to the square of its singular value (`σᵢ²`).
- Reduce Dimensionality: Select the first `k` columns of `V` to create `V_k`.
- Project the Data: The new, lower-dimensional dataset is `Transformed_Data = B * V_k`.
**Up Next:** We will explore other powerful applications of the SVD, including **low-rank approximation**.