Lesson 8.2: Recurrent Neural Networks (RNNs): A Network with Memory

We've identified the 'amnesia' problem of standard networks. This lesson introduces our first solution: the Recurrent Neural Network (RNN). We will explore how adding a 'loop' to the architecture allows information to persist, creating a 'hidden state' that acts as the network's memory. We'll also unpack the concept of 'backpropagation through time' to understand how these models learn.

Part 1: Introducing the Recurrent Loop

The core innovation of an RNN is breathtakingly simple. In an MLP, information flows one way. An RNN breaks this rule by adding a **feedback loop**: the output of a layer is fed back into itself as an input for the next time step.

This loop creates the concept of a **hidden state** ( $h_t$ ). This hidden state is a vector that acts as the network's memory, carrying a summary of the entire sequence seen so far.

The Core Analogy: The Reader with a Running Summary

Imagine a person reading a sentence one word at a time, keeping a running summary in their head.

t=1 (Input="The"): The reader sees "The." Their hidden state becomes a summary like: {subject: ?}.
t=2 (Input="cat"): The reader sees "cat" AND their previous summary. They update their state: {subject: 'cat'}.
t=3 (Input="sat"): They see "sat" and the state {subject: 'cat'}. They update: {subject: 'cat', verb: 'sat'}.
t=4 (Input="on"): They see "on" and the state {subject: 'cat', verb: 'sat'}. They update: {subject: 'cat', verb: 'sat', preposition: 'on'}.

The hidden state at each step is a function of two things: the **new information** (the current word) and the **old memory** (the previous hidden state). This is exactly how an RNN works.

Part 2: The Mathematics of an RNN Cell

A simple RNN cell has two key equations that are repeated at every single time step.

The RNN Update Equations

1. The Hidden State Update: The new hidden state, $h_t$ , is a function of the current input, $x_t$ , and the previous hidden state, $h_{t-1}$ .
$h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$
2. The Output Calculation: The output for the current time step, $y_t$ , is a function of the new hidden state.
$y_t = W_{hy} h_t + b_y$

$W_{hh}, W_{xh}, W_{hy}$ are the **weight matrices** that the network learns. Crucially, these weights are **shared** across all time steps. The network learns one set of rules for how to update its memory and applies it over and over.
$\tanh$ is the hyperbolic tangent activation function, a variant of the sigmoid.

The "unrolled" view of an RNN shows how this works. It's like having a chain of identical neural network blocks, where each block passes its memory on to the next one in the sequence.

Part 3: Training an RNN - Backpropagation Through Time (BPTT)

How do we train a network with a loop? We use a clever trick: we "unroll" the network through time and treat it as one very deep feed-forward network, where each time step is a layer. We can then apply our standard backpropagation algorithm.

This process is called **Backpropagation Through Time (BPTT)**.

However, there is a critical complication. Because the same weight matrix ( $W_{hh}$ ) is used at every single step, the gradient calculation for that weight at the end of the sequence involves summing up its influence from every single time step. This leads to a famous and devastating problem.

Part 4: Python Implementation

A Simple RNN in Keras/TensorFlow

Here is how you would build a simple RNN to classify movie reviews as positive or negative.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense

vocab_size = 10000
max_length = 500
embedding_dim = 32

model = Sequential([
    # 1. Embedding Layer: Turns word IDs into dense vectors
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length),

    # 2. SimpleRNN Layer: The recurrent 'memory' part
    # It processes the sequence of word vectors and outputs a single vector summary.
    SimpleRNN(units=32),

    # 3. Dense Output Layer: For final classification
    # A single neuron with a sigmoid for binary classification (positive/negative)
    Dense(units=1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

model.summary()

What's Next? The Tragic Flaw of Simple RNNs

The RNN is a brilliant idea, but the simple version has a tragic, fatal flaw that makes it almost unusable in practice for long sequences. The process of backpropagating through many time steps leads to a problem where the gradient signal from the past either vanishes to zero or explodes to infinity.

This means a simple RNN has a very short-term memory. It can't learn dependencies between words that are far apart in a sentence.

In our next lesson, we will dive deep into the mathematics of the **Vanishing and Exploding Gradient Problem**, understanding exactly why it happens and why it motivated the invention of more sophisticated recurrent architectures.

The Problem of Memory: Why MLPs Fail on Sequence Data

The Vanishing/Exploding Gradient Problem in RNNs