Lesson 8.4: The Solution: LSTM and GRU
We've diagnosed the disease: vanishing gradients prevent RNNs from learning long-term patterns. This lesson introduces the cure. We'll explore the ingenious 'gated' architectures of Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), which explicitly control the flow of information, allowing them to remember or forget information over long sequences.
Part 1: The Core Idea - A Separate 'Memory Lane'
The problem with a simple RNN is that its hidden state, , tries to do two jobs at once: it has to store long-term context while also being transformed at every step to make the next prediction. This constant transformation is what mangles the gradient.
The solution, pioneered by Sepp Hochreiter and Jürgen Schmidhuber, was to create a **separate, protected "memory lane"** that could carry information through time with minimal disruption. This is the **cell state**, , in an LSTM.
The Core Analogy: The 'Conveyor Belt' of Memory
Imagine the LSTM cell state, , is a conveyor belt running through time.
- By default, information on the conveyor belt just travels along unchanged. This makes it easy for gradients to flow backward through time without vanishing.
- The LSTM has three "gatekeepers" (neural network layers with sigmoid activations) who can interact with the conveyor belt at each time step.
Part 2: The Architecture of an LSTM Cell
An LSTM cell has three "gates" that control its memory. Each gate is a small neural network that reads the current input () and the previous hidden state () and outputs a number between 0 and 1, acting like a valve.
Imagine a complex diagram of an LSTM cell showing the input, previous hidden state, previous cell state, and the three gates (Forget, Input, Output) interacting to produce the new cell state and new hidden state.
The Three Gates of an LSTM
- The Forget Gate ():
Decides what information to throw away from the old cell state, . It looks at the new input and previous state and outputs a number between 0 ("forget this completely") and 1 ("keep this completely") for each piece of information in the cell state.
- The Input Gate ():
Decides what new information to store in the cell state. This is a two-part process:
- A sigmoid layer () decides *which* values to update.
- A tanh layer creates a vector of new candidate values, , that could be added.
- The Output Gate ():
Decides what part of the (now updated) cell state to use for the output hidden state, . A sigmoid layer decides which parts of the cell state to output, which is then multiplied by the (tanh-squashed) cell state.
Putting It All Together: The LSTM State Updates
The final update equations are:
1. Update the Cell State (The Conveyor Belt):
This is the key step. We "forget" some old information () and "add" some new information (). The denotes element-wise multiplication.
2. Create the New Hidden State:
The output is a filtered version of the updated cell state.
Part 3: Gated Recurrent Unit (GRU) - The Simpler Cousin
A Gated Recurrent Unit (GRU), introduced by Kyunghyun Cho et al., is a slightly simpler and more computationally efficient variation of the LSTM. It achieves similar performance on many tasks with a less complex architecture.
The main changes in a GRU are:
- Combined Cell and Hidden State: It merges the cell state and hidden state into a single state, .
- Two Gates instead of Three: It has an **Update Gate** (), which decides how much of the past state to keep, and a **Reset Gate** (), which decides how much of the past state to forget when proposing a new state.
GRUs have fewer parameters than LSTMs and can train slightly faster, but LSTMs are often more powerful on datasets with very long-range dependencies.
What's Next? Beyond Fixed Context
LSTMs and GRUs were the undisputed kings of sequence modeling for many years. They solved the vanishing gradient problem and enabled deep learning models to understand language and time series in ways that were previously impossible.
However, they still have a limitation. The entire "memory" of a long sequence must be compressed into a single, fixed-size hidden state vector, . This can create a bottleneck. For very long sequences, like translating a long paragraph, is it reasonable to expect a single vector to remember every detail?
What if, when making a prediction at the end of a sentence, the model could "look back" and pay specific **attention** to the most relevant words from the past, rather than relying on a compressed summary?
This is the idea behind the **Attention Mechanism**, the subject of our next lesson and the final step on our journey to the Transformer architecture.