Lesson 3.5: The Master Recipe: Maximum Likelihood Estimation (MLE)

We now learn the most powerful and principled method for creating estimators. MLE flips the question of probability on its head to find the parameter values that make our observed data most probable. This single idea is the engine behind OLS, logistic regression, and most of modern machine learning.

Part 1: A Revolutionary Way of Thinking

The Method of Moments was intuitive but flawed. Maximum Likelihood Estimation (MLE), developed by the legendary statistician R.A. Fisher, provides a more powerful and robust approach by asking a fundamentally different question.

The Detective Analogy: A detective finds a size 11 footprint at a crime scene. She has two suspects: Suspect A wears size 8, Suspect B wears size 11.
The Old Question (Probability): If Suspect A is the culprit, what's the probability of a size 11 print? (Very low).
The New Question (Likelihood): Given the evidence (the size 11 print), which suspect is more *likely* to be the culprit? (Suspect B).
MLE works like the detective. It looks at the data we *actually collected* and asks: what value of our parameter makes this data the *most plausible, most likely, least surprising* outcome?

The Principle of Maximum Likelihood

The Maximum Likelihood Estimate (MLE), $\hat{\theta}_{MLE}$ , is the specific value of the parameter $\theta$ (e.g., $\mu$ , $p$ , $\beta$ ) that maximizes the probability (the "likelihood") of observing the sample data $(X_1, \dots, X_n)$ that was actually collected.

Part 2: The Mechanics of MLE

Step 1: The Likelihood Function

L(\theta|\mathbf{X})

The Likelihood Function is the joint probability of observing our i.i.d. sample, $\mathbf{X}=(X_1, \dots, X_n)$ . Since the observations are independent, this is the product of the individual probability density/mass functions (PDF/PMF).

L(\theta | \mathbf{X}) = \prod_{i=1}^n f(X_i ; \theta)

Crucially, we treat this as a function of the parameter $\theta$ , holding our data $\mathbf{X}$ fixed.

Step 2: The Log-Likelihood Function

\ell(\theta|\mathbf{X})

Maximizing a long product of functions is a calculus nightmare. So, we use a brilliant trick: we maximize the natural logarithm of the likelihood instead.

The Logarithm Trick

The natural log, $\ln(x)$ , is a strictly increasing function. This means the value of $\theta$ that maximizes $L(\theta)$ is the *exact same* value that maximizes $\ln(L(\theta))$ . The log turns our difficult product into a manageable sum:

\ell(\theta|\mathbf{X}) = \ln(L(\theta|\mathbf{X})) = \sum_{i=1}^n \ln(f(X_i ; \theta))

The MLE Procedure

Write down the log-likelihood function $\ell(\theta|\mathbf{X})$ .
Take the partial derivative of $\ell$ with respect to each parameter $\theta$ . This is called the **score function**.
Set the score function(s) to zero.
Solve for the parameter(s). The solution is the MLE, $\hat{\theta}_{MLE}$ .

Part 3: The Grand Unification: OLS is MLE

This is one of the most beautiful results in statistics. We can prove that the Ordinary Least Squares (OLS) method is simply a special case of the more general Maximum Likelihood principle, under one key assumption.

Proof: OLS is the MLE for a Normal Linear Model

Assumption: We have a linear model $y_i = \mathbf{x}_i^T\bm{\beta} + \epsilon_i$ , where the errors are i.i.d. Normal: $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ .

This implies that the conditional distribution of $y_i$ is $y_i | \mathbf{x}_i \sim \mathcal{N}(\mathbf{x}_i^T\bm{\beta}, \sigma^2)$ .

Step 1: Write the log-likelihood for the entire sample.

The log of the Normal PDF for a single observation $y_i$ is:

\ln(f(y_i)) = -\frac{1}{2}\ln(2\pi\sigma^2) - \frac{(y_i - \mathbf{x}_i^T\bm{\beta})^2}{2\sigma^2}

The log-likelihood for all $n$ observations is the sum:

\ell(\bm{\beta}, \sigma^2) = \sum_{i=1}^n \left( -\frac{1}{2}\ln(2\pi\sigma^2) - \frac{(y_i - \mathbf{x}_i^T\bm{\beta})^2}{2\sigma^2} \right)

\ell(\bm{\beta}, \sigma^2) = -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \mathbf{x}_i^T\bm{\beta})^2

Step 2: Find the $\bm{\beta}$ that maximizes the log-likelihood.

\hat{\bm{\beta}}_{MLE} = \arg\max_{\bm{\beta}} \left( -\frac{n}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \mathbf{x}_i^T\bm{\beta})^2 \right)

The first term is a constant with respect to $\bm{\beta}$ . Maximizing this expression is equivalent to *minimizing* the second term's positive component.

\hat{\bm{\beta}}_{MLE} = \arg\min_{\bm{\beta}} \sum_{i=1}^n (y_i - \mathbf{x}_i^T\bm{\beta})^2

Conclusion: This is precisely the objective function for Ordinary Least Squares—minimizing the Sum of Squared Residuals (SSR)!

The OLS ≡ MLE Identity

Under the assumption of Normally distributed errors, the OLS estimator is identical to the Maximum Likelihood Estimator.

\hat{\bm{\beta}}_{OLS} = \hat{\bm{\beta}}_{MLE}

This gives OLS a powerful theoretical justification. It's not just an arbitrary method that minimizes squared errors; it's the method that finds the most plausible coefficients assuming a Normal error structure.

Report Card: Why MLE is the Champion

MLE is the gold standard because its estimators have phenomenal large-sample properties:

Consistent: YES. MLEs converge in probability to the true parameter values.
Asymptotically Efficient: YES. For large samples, MLEs achieve the Cramér-Rao Lower Bound. No other consistent estimator has a lower variance. They are the "best in class."
Asymptotically Normal: YES. The distribution of an MLE approaches a Normal distribution as $n$ grows, which allows us to use t-tests and confidence intervals for inference.
Unbiased: NOT ALWAYS. MLEs can be biased in small samples. For example, the MLE for variance, $\hat{\sigma}^2_{MLE}$ , divides by $n$ , making it biased. However, this bias disappears for large samples.

What's Next? When Calculus Fails

We've seen how to find the MLE for the Normal distribution by taking a derivative and setting it to zero. This works beautifully when we have a simple, closed-form solution.

But what happens when we use a more complex distribution (like the Student's t-distribution for fat-tailed financial data) or a complex model (like a GARCH model for volatility)? The math becomes intractable. We can't solve for $\hat{\theta}$ with a pen and paper.

In the next lesson, we'll learn how computers handle this problem using **Numerical Optimization** to find the MLE.

Up Next: Next: Finding the MLE with Numerical Optimization