Lesson 6.2: Principal Component Analysis (PCA)

The premier algorithm for dimensionality reduction, powered by SVD.

Imagine you are an astronomer who has just collected data on the positions of a million stars in a newly discovered galaxy. Your dataset is a massive table with three columns: `x`, `y`, and `z` coordinates. This is a 3-dimensional dataset. But as you plot the data, you notice something remarkable: the galaxy is almost completely flat, like our own Milky Way. It forms a thin, rotating disk.

While your data lives in 3D, the "interesting" information—the structure of the galaxy—is almost entirely 2-dimensional. The third dimension, the "thickness" of the disk, is mostly just small, random noise. Wouldn't it be great if you could find the perfect 2D plane that best represents your galaxy? This is the exact problem that Principal Component Analysis (PCA) solves.

The Goal of PCA: Finding the Directions of Maximum Variance

The Core Idea

PCA is a technique for finding a new, optimal coordinate system for your data. The axes of this new system are called Principal Components.

  • The first principal component (PC1) is the direction in the data with the largest possible variance.
  • The second principal component (PC2) is the direction with the next largest variance, with the crucial constraint that it must be orthogonal to PC1.
  • PC3 is the next most important direction orthogonal to both PC1 and PC2, and so on.
  • These new axes are uncorrelated.

The Complete PCA Algorithm (The Recipe)

The Recipe
  1. Center the Data: For each feature (column) in your data matrix `X`, calculate its mean and subtract it from every entry in that column. This creates a new matrix `B` where every column has a mean of zero.
  2. Compute the SVD: Compute the Singular Value Decomposition of the centered data matrix: `B = UΣVᵀ`.
  3. Identify Principal Components: The principal components (the new axes of your data) are the columns of the matrix `V`.
  4. Measure Variance Explained: The variance explained by each principal component is proportional to the square of its corresponding singular value (`σᵢ²`). This tells you how important each new axis is.
  5. Reduce Dimensionality: To reduce your data from `n` dimensions to `k` dimensions, select the first `k` columns of `V` to create a new matrix `V_k`.
  6. Project the Data: Your new, lower-dimensional dataset is the result of projecting the centered data `B` onto these new axes: `Transformed_Data = B * V_k`.
Summary: PCA in a Nutshell
  • PCA finds the directions of maximum variance in a dataset.
  • This is achieved by performing an SVD on the mean-centered data matrix.
  • The right singular vectors (`V`) are the principal components (the new, optimal axes).
  • The singular values (`Σ`) tell you the importance of each component.