Lesson 1.14: Measuring Relationships: Covariance & Correlation
We've learned to describe relationships with distributions, but how can we summarize them with a single number? This lesson introduces Covariance and Correlation, the two most important metrics for quantifying how two variables move together. This is the cornerstone of portfolio theory, risk management, and regression analysis.
Part 1: The Intuition - The Four Quadrants of a Scatter Plot
Imagine we plot the daily returns of two stocks, Apple () and Microsoft (), on a scatter plot. We want to answer a simple question: "When Apple is above its average return, is Microsoft also likely to be above its average return?"
The Core Idea: We can divide the scatter plot into four quadrants by drawing a vertical line at the mean of X () and a horizontal line at the mean of Y ().
Imagine a scatter plot with a positive trend. A crosshair is drawn at the mean(X) and mean(Y). Most points are in the top-right and bottom-left quadrants.
Let's analyze the product of the deviations from the mean, , in each quadrant:
- Top-Right Quadrant: (positive deviation) and (positive deviation). The product is POSITIVE.
- Bottom-Left Quadrant: (negative deviation) and (negative deviation). The product is POSITIVE.
- Top-Left & Bottom-Right: One deviation is positive and one is negative. The product is NEGATIVE.
If the stocks tend to move together, most points will be in the top-right and bottom-left quadrants. The average of all these products will be positive. This "average product of deviations" is the **Covariance**.
Part 2: Covariance - A Measure of Joint Variability
Definition: Covariance
In practice, we use the "shortcut" formula for calculations:
The Problem with Covariance
Covariance gives us the *direction* of the relationship (positive or negative), but the *magnitude* is hard to interpret. If we measure returns in dollars, the covariance is in dollars-squared. If we measure in percentage, the number is completely different. It's not standardized.
Part 3: Correlation - The Standardized Solution
Definition: Pearson Correlation Coefficient
The Superpowers of Correlation
- Bounded Range: Correlation is always between -1 and +1. .
- : Perfect positive linear relationship.
- : Perfect negative linear relationship.
- Unitless: Correlation has no units, allowing us to compare the strength of relationships between completely different pairs of variables.
- Measures LINEAR Relationships Only: This is a crucial limitation. A correlation of 0 does NOT mean "no relationship." It only means no *linear* relationship. A perfect U-shaped relationship could have .
In Python
In the real world, you'll almost always compute a correlation matrix using a library like Pandas.
import pandas as pd
# df is a DataFrame with columns 'Apple_Return' and 'Microsoft_Return'
correlation_matrix = df.corr()The concept of covariance/correlation is the mathematical engine behind diversification, the only "free lunch" in investing.
The variance (risk) of a two-asset portfolio is given by:
If the covariance is negative, the third term actively **reduces** the portfolio's total risk. By combining assets that don't move together, you can lower your risk without sacrificing expected return. This is the central idea of Modern Portfolio Theory and all professional asset allocation.
In Machine Learning, calculating the correlation matrix is the first step of feature analysis to detect multicollinearity, which can make linear regression models unstable.
What's Next? The Strongest Form of Separation
A correlation of zero means no linear relationship. But what if two variables have *no relationship whatsoever*, linear or non-linear?
This stronger condition is called **Statistical Independence**. The final lesson of our foundational module will formally define independence and show how it relates to (and differs from) having zero correlation.