Lesson 2.5: The Chi-Squared (χ²) Distribution
We now meet the first and most fundamental of the 'sampling distributions.' The Chi-Squared distribution is built from the Normal distribution and its primary job is to model the behavior of sample variance. It is the essential building block for constructing both the t-distribution and the F-distribution.
Part 1: Constructing the Chi-Squared Distribution
Imagine we take a standard normal variable, , and we square it. What is the distribution of ? It's certainly not Normal anymore (it can't be negative!). This simple question leads us directly to the definition of the Chi-Squared distribution.
The Core Idea: A Chi-Squared distribution is the distribution of a sum of squared independent standard normal variables. It is the fundamental distribution for anything involving variance.
Definition: The χ² (Chi-Squared) Distribution
Let are independent random variables, and each is distributed Standard Normal (), then the sum of their squares follows a distribution with degrees of freedom ().
Part 2: Properties of the Chi-Squared Distribution
Imagine a plot showing several Chi-Squared curves. For low k (like 2), it's highly skewed right. As k increases to 10 or 20, the curve becomes more symmetric and bell-shaped.
- Shape: The distribution is always skewed to the right and is only defined for positive values (since it's a sum of squares).
- Symmetry: It becomes less skewed and more symmetric as the degrees of freedom () increase.
The mean and variance have a beautifully simple relationship with the degrees of freedom.
Chi-Squared Moments
If :
Expected Value:
Variance:
Part 3: The Critical Connection to Sample Variance
The abstract definition is powerful, but the reason the Chi-Squared distribution is a cornerstone of statistics is its relationship to the sample variance ().
Theorem: The Distribution of Sample Variance
Let be an i.i.d. sample from a population.
Let be the sample variance.
Then the following quantity has a Chi-Squared distribution with degrees of freedom:
Intuition: Why n-1 Degrees of Freedom?
We start with independent pieces of information. However, to calculate the sample variance , we first have to calculate the sample mean from the same data. The sample mean acts as one constraint on the data. For example, once you know and the first data points, the last data point is no longer free to vary; its value is fixed. Therefore, only pieces of information are "free" to determine the sample variance.
- Foundation for Other Tests: This is its most important role. The Chi-Squared distribution is a prerequisite for understanding the next two distributions in our toolbox.
- The **t-distribution** is formed by a ratio involving a Normal and a Chi-Squared variable.
- The **F-distribution** is formed by a ratio of two Chi-Squared variables.
- Hypothesis Testing for Variance: This theorem allows us to construct confidence intervals and perform hypothesis tests on a population variance , a key task in quality control and financial risk assessment (e.g., "Is the volatility of our new trading strategy significantly lower than the old one?").
- Goodness-of-Fit Tests: In machine learning and data analysis, the famous Pearson's Chi-Squared test uses this distribution to check if the observed counts in different categories match the expected counts from a theory. It's a fundamental tool for A/B testing and analyzing categorical data.
What's Next? Building the t-Distribution
We've now mastered the distribution that governs sample variance. But in the real world, we almost never know the true population variance .
So what happens when we try to standardize our sample mean () using our *estimate* of the standard deviation () instead of the true value ()? The result is no longer a perfect Z-distribution. It follows a new, slightly wider distribution designed to account for this extra uncertainty.
In the next lesson, we will combine the Normal and the Chi-Squared to derive the workhorse of all statistical inference: the Student's t-Distribution.