Lesson 8.2: Recurrent Neural Networks (RNNs): A Network with Memory
We've identified the 'amnesia' problem of standard networks. This lesson introduces our first solution: the Recurrent Neural Network (RNN). We will explore how adding a 'loop' to the architecture allows information to persist, creating a 'hidden state' that acts as the network's memory. We'll also unpack the concept of 'backpropagation through time' to understand how these models learn.
Part 1: Introducing the Recurrent Loop
The core innovation of an RNN is breathtakingly simple. In an MLP, information flows one way. An RNN breaks this rule by adding a **feedback loop**: the output of a layer is fed back into itself as an input for the next time step.
This loop creates the concept of a **hidden state** (). This hidden state is a vector that acts as the network's memory, carrying a summary of the entire sequence seen so far.
The Core Analogy: The Reader with a Running Summary
Imagine a person reading a sentence one word at a time, keeping a running summary in their head.
- t=1 (Input="The"): The reader sees "The." Their hidden state becomes a summary like: {subject: ?}.
- t=2 (Input="cat"): The reader sees "cat" AND their previous summary. They update their state: {subject: 'cat'}.
- t=3 (Input="sat"): They see "sat" and the state {subject: 'cat'}. They update: {subject: 'cat', verb: 'sat'}.
- t=4 (Input="on"): They see "on" and the state {subject: 'cat', verb: 'sat'}. They update: {subject: 'cat', verb: 'sat', preposition: 'on'}.
The hidden state at each step is a function of two things: the **new information** (the current word) and the **old memory** (the previous hidden state). This is exactly how an RNN works.
Part 2: The Mathematics of an RNN Cell
A simple RNN cell has two key equations that are repeated at every single time step.
The RNN Update Equations
- 1. The Hidden State Update: The new hidden state, , is a function of the current input, , and the previous hidden state, .
- 2. The Output Calculation: The output for the current time step, , is a function of the new hidden state.
- are the **weight matrices** that the network learns. Crucially, these weights are **shared** across all time steps. The network learns one set of rules for how to update its memory and applies it over and over.
- is the hyperbolic tangent activation function, a variant of the sigmoid.
The "unrolled" view of an RNN shows how this works. It's like having a chain of identical neural network blocks, where each block passes its memory on to the next one in the sequence.
Part 3: Training an RNN - Backpropagation Through Time (BPTT)
How do we train a network with a loop? We use a clever trick: we "unroll" the network through time and treat it as one very deep feed-forward network, where each time step is a layer. We can then apply our standard backpropagation algorithm.
This process is called **Backpropagation Through Time (BPTT)**.
However, there is a critical complication. Because the same weight matrix () is used at every single step, the gradient calculation for that weight at the end of the sequence involves summing up its influence from every single time step. This leads to a famous and devastating problem.
Part 4: Python Implementation
A Simple RNN in Keras/TensorFlow
Here is how you would build a simple RNN to classify movie reviews as positive or negative.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense
vocab_size = 10000
max_length = 500
embedding_dim = 32
model = Sequential([
# 1. Embedding Layer: Turns word IDs into dense vectors
Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length),
# 2. SimpleRNN Layer: The recurrent 'memory' part
# It processes the sequence of word vectors and outputs a single vector summary.
SimpleRNN(units=32),
# 3. Dense Output Layer: For final classification
# A single neuron with a sigmoid for binary classification (positive/negative)
Dense(units=1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()
What's Next? The Tragic Flaw of Simple RNNs
The RNN is a brilliant idea, but the simple version has a tragic, fatal flaw that makes it almost unusable in practice for long sequences. The process of backpropagating through many time steps leads to a problem where the gradient signal from the past either vanishes to zero or explodes to infinity.
This means a simple RNN has a very short-term memory. It can't learn dependencies between words that are far apart in a sentence.
In our next lesson, we will dive deep into the mathematics of the **Vanishing and Exploding Gradient Problem**, understanding exactly why it happens and why it motivated the invention of more sophisticated recurrent architectures.