Lesson 3.6: Finding MLE Estimates via Optimization

We've defined MLE as finding the peak of the log-likelihood function. For simple models, we can use calculus. But for most real-world problems, the math is too hard. This lesson explains how computers 'train' a model by using numerical optimization algorithms to iteratively search for that peak.

Part 1: When Pen and Paper Fails

Finding an MLE involves solving the equation (θ)θ=0\frac{\partial \ell(\theta)}{\partial \theta} = 0. This leads to two distinct worlds of solutions.

Analytical Solution

This is a "closed-form" solution—a direct formula. It's what we found for OLS.

β^=(XTX)1XTy\hat{\bm{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

This is fast, exact, and perfect. We prefer it whenever it exists.

Numerical Solution

For most complex models (like GARCH in finance or Neural Networks in ML), the derivative equation is impossible to solve with algebra. There is no formula.

Instead, we must use an iterative algorithm to **search** for the best θ^\hat{\theta}. This search is called **numerical optimization**.

Part 2: The Logic of Hill Climbing

The Analogy: Climbing a Mountain in a Fog

Imagine the log-likelihood function is a mountain range, and you want to find the highest peak. The problem is, you're in a thick fog and can only see the ground right at your feet.

The logical strategy is to:

  1. Feel the ground to find the direction of steepest ascent (the **gradient**).
  2. Take a step in that direction.
  3. Repeat until every direction is downhill (you're at a peak).

This simple, iterative process is the heart of almost all modern model training.

The Simple Climber: Gradient Ascent

The simplest algorithm uses only the first derivative (the slope).

Gradient Ascent Algorithm

  1. Initialize: Start with a random guess, θ^0\hat{\theta}_0.
  2. Iterate: Repeatedly update your guess using the rule:
    θ^new=θ^old+η(θ^old)\hat{\theta}_{\text{new}} = \hat{\theta}_{\text{old}} + \eta \cdot \nabla \ell(\hat{\theta}_{\text{old}})
    • η\eta is the **learning rate** (your step size).
    • (θ)\nabla \ell(\theta) is the **gradient** (the direction of steepest ascent).
  3. Converge: Stop when the gradient is nearly zero (0\nabla \ell \approx 0), meaning you've reached a peak.
The Expert Climber: Newton-Raphson

A smarter climber uses not just the slope, but also the *curvature* of the mountain to take more intelligent steps.

  • On a gentle slope (low curvature), take a big step.
  • Near a sharp peak (high curvature), take a small, careful step.

The **Newton-Raphson** algorithm incorporates the second derivative (the Hessian, related to Fisher Information I(θ)I(\theta)) to automatically adjust the step size.

Newton-Raphson Update Rule

θ^new=θ^old+[I(θ^old)]1(θ^old)\hat{\theta}_{\text{new}} = \hat{\theta}_{\text{old}} + [I(\hat{\theta}_{\text{old}})]^{-1} \cdot \nabla \ell(\hat{\theta}_{\text{old}})

This method converges much faster and is the standard for most statistical software (R, Python's `statsmodels`).

The Payoff: 'Training' is Just Optimization

    This entire process is what machine learning practitioners mean when they say they are **"training a model."**

    • The "Loss Function": In ML, we usually *minimize* a loss function. Minimizing a loss function is mathematically identical to *maximizing* a log-likelihood function. The "Cross-Entropy Loss" used in classification is just another name for the negative log-likelihood of a Bernoulli/Multinomial distribution.
    • Gradient Descent: When an ML engineer trains a neural network, they are running **Gradient Descent** (the minimization version of our algorithm) to find the weights (β^\hat{\bm{\beta}}) that minimize the loss.
    • SGD: For massive datasets, they use **Stochastic Gradient Descent (SGD)**, which cleverly approximates the gradient using small "mini-batches" of data to speed up each step.

What's Next? Act III - Putting Our Estimates to Work

Congratulations! You have now completed "Act II" of Module 3. We have learned the properties of good estimators (Unbiased, Efficient, Consistent) and mastered the two main recipes for finding them: the simple Method of Moments and the powerful Maximum Likelihood Estimation.

We are now ready for "Act III": **Inference**. We have a good estimate, θ^\hat{\theta}. Now what? How do we quantify our uncertainty and use our estimate to make decisions?

In the next lesson, we will learn how to build a range of plausible values around our estimate by constructing **Confidence Intervals**.

Up Next: Act III: General Construction of Confidence Intervals