Lesson 8.3: The Vanishing/Exploding Gradient Problem
We have our RNN, a network with memory. But this memory is tragically short-term. This lesson explores the fundamental mathematical flaw that prevents simple RNNs from learning long-range dependencies. Understanding the 'vanishing gradient' problem is the key to appreciating why more complex architectures like LSTMs were invented.
Part 1: The Problem of Deep Networks
Let's revisit the concept of Backpropagation. To update a weight in an early layer of a network, we use the chain rule to multiply gradients from all the later layers backwards.
For a very deep network, the gradient for an early layer is a long product:
Now, think about our "unrolled" RNN from the last lesson. It looks like a feed-forward network that is as deep as our sequence is long. To update the shared weights based on an error that happened 50 time steps ago, we have to backpropagate the error signal "back in time" through 50 layers.
This long chain of multiplications is the source of the problem.
Part 2: The Two Sides of the Same Coin
The Core Analogy: The 'Repeated Multiplication' Effect
Imagine you have a number and you multiply it by another number over and over again.
- The Vanishing Gradient: If you start with 1 and repeatedly multiply it by a number **less than 1** (e.g., 0.9), the result shrinks exponentially towards zero.
0.9^1=0.9, 0.9^2=0.81, 0.9^{10} approx 0.35, 0.9^{50} approx 0.005
After 50 steps, the original signal is almost completely gone. - The Exploding Gradient: If you start with 1 and repeatedly multiply it by a number **greater than 1** (e.g., 1.1), the result grows exponentially towards infinity.
1.1^1=1.1, 1.1^2=1.21, 1.1^{10} approx 2.6, 1.1^{50} approx 117
After 50 steps, the signal has exploded.
The gradients in our RNN backpropagation are a long product of Jacobian matrices (matrices of partial derivatives). If the eigenvalues of these matrices are consistently less than 1, the gradient signal from the past **vanishes**. If they are consistently greater than 1, the gradient **explodes**.
In practice, with activation functions like tanh that have derivatives less than 1, the **vanishing gradient problem is far more common and troublesome**.
Part 3: The Consequences for Learning
The vanishing gradient problem has a devastating effect on an RNN's ability to learn.
Consequence 1: Loss of Long-Term Memory
The gradient is the "learning signal." If the gradient from 50 steps ago has vanished to zero by the time it reaches the first time step, then the model **cannot learn** the connection between the input at step 1 and the error at step 50.
For our language example, "The cat sat ... on the mat," if the sentence is very long, the gradient from the word "mat" will vanish before it gets back to the word "cat." The model can't learn that "cat" was the subject that influenced the final word.
Consequence 2: Unstable Training
The exploding gradient problem is easier to detect but just as bad. The weight updates become enormous, causing the model's parameters to oscillate wildly and diverge. The training process becomes completely unstable.
A common solution for exploding gradients is **gradient clipping**: if the norm of the gradient exceeds a certain threshold, you simply scale it back down. This is a practical but crude fix.
What's Next? A Brain with Gates
The simple RNN is fundamentally flawed. Its mechanism of continuously multiplying its hidden state by a weight matrix is mathematically doomed to either vanish or explode over long sequences.
We need a more sophisticated way to manage the flow of information through time. What if, instead of just a simple update rule, the neuron had "gates" that could make intelligent decisions?
- A **"Forget Gate"** that decides which pieces of old information are no longer relevant and should be discarded.
- An **"Input Gate"** that decides which pieces of new information are important enough to store in memory.
- An **"Output Gate"** that decides which pieces of the current memory should be used to make the prediction.
This is the core idea behind the two architectures that solved the vanishing gradient problem and powered the deep learning revolution for sequences: the **Long Short-Term Memory (LSTM)** network and its simpler cousin, the **Gated Recurrent Unit (GRU)**. In our next lesson, we will explore the architecture of these gated RNNs.