Lesson 9.3: The Rise of Embeddings: The Intuition of Word2Vec
The Bag-of-Words model is useful but 'dumb'—it has no concept of meaning. This lesson introduces the revolutionary idea of word embeddings. We will explore the intuition behind the Word2Vec algorithm, which learns dense, meaningful vector representations for words by training a neural network to predict a word from its context.
Part 1: The Problem with 'One-Hot' Vectors
The Bag-of-Words and TF-IDF models we learned create vectors that are very **high-dimensional** (the size of the vocabulary) and **sparse** (mostly zeros). But their biggest flaw is that every word is perfectly **orthogonal**.
In a vocabulary of 10,000 words, the vector for "good" might be `[0,0,...,1,...,0]` and the vector for "great" might be `[0,...,1,...,0,0]`. The dot product of these two vectors is zero. Mathematically, they are as unrelated as "good" and "chainsaw."
We need a new representation—a **dense embedding**—where similar words are close to each other in vector space.
Part 2: The Core Insight of Word2Vec (The 'Distributional Hypothesis')
The Core Idea: 'You shall know a word by the company it keeps.'
This linguistic idea is the heart of Word2Vec, developed by Tomas Mikolov at Google in 2013. The algorithm doesn't learn the "definition" of a word. Instead, it learns a vector representation for each word such that words that appear in similar contexts will have similar vectors.
Consider these sentences:
- "The stock had a **strong** Q3 performance."
- "The stock had a **robust** Q3 performance."
The words "strong" and "robust" appear surrounded by almost identical context words ("a", "Q3", "performance"). Word2Vec is designed to force their vectors to be close together.
Word2Vec is a simple, shallow neural network that is trained on a "fake" supervised learning task. There are two main architectures:
1. Continuous Bag-of-Words (CBOW)
The Task: Predict a target word based on its surrounding context words.
Given `{"the", "cat", "on", "the", "mat" }`, predict `sat`.
2. Skip-gram
The Task: Predict the surrounding context words based on a single target word.
Given `sat`, predict `{"the", "cat", "on", "the", "mat" }`.
Skip-gram is generally preferred as it performs better for infrequent words. We will focus on its intuition.
Part 3: A Walkthrough of the Skip-gram Architecture
Let's imagine our vocabulary has 10,000 words and we want to create embeddings of size 300 (i.e., a 300-dimensional vector for each word).
Imagine a diagram: Input Word -> Embedding Layer -> Output Layer -> Predicted Context Words.
- The Input: A single word, represented as a one-hot encoded vector of size 10,000.
- The Hidden Layer (The Embedding Matrix): This is the magic. The hidden layer is simply a giant lookup table—a matrix of weights. Let's call it . When our one-hot input vector (which is all zeros except for a single 1) is multiplied by this matrix, it effectively just **selects the corresponding row**. The output of this layer is the 300-dimensional "word vector" for our input word.
- The Output Layer: This layer has 10,000 neurons, one for each word in the vocabulary, with a softmax activation function. It produces a probability distribution over the entire vocabulary.
- The Training Process: The network is trained using Gradient Descent. Its goal is to adjust the weights in the embedding matrix so that the output probability distribution gets closer and closer to predicting the actual context words from the training text.
The 'Aha!' Moment: The Byproduct is the Treasure
After training on billions of words, the final neural network is actually **thrown away**. The "fake" task of predicting context words was just a means to an end.
The real treasure is the trained weight matrix of the hidden layer—the embedding matrix, . This matrix is our word-to-vector dictionary. Each row is a dense, 300-dimensional vector that captures the "meaning" of a word based on all the contexts it appeared in.
Part 4: The Power of Dense Vectors - Vector Arithmetic
These learned embeddings are not just random numbers; they capture complex semantic relationships that allow us to perform "vector arithmetic" with words.
Famous Example: King - Man + Woman = Queen
The vector from "Man" to "King" captures the concept of "royalty." Adding this "royalty" vector to "Woman" lands us very close to the vector for "Queen" in the embedding space.
Financial Example:
This allows models to understand analogies and relationships without being explicitly programmed to do so.
- Word2Vec learns **dense**, low-dimensional representations of words.
- It works by training a neural network on a "fake" task of predicting context.
- The trained hidden layer weights become the **word embeddings**.
- These embeddings capture semantic relationships, allowing for vector arithmetic with words.
- Using pre-trained embeddings (like GloVe or fastText) allows us to transfer knowledge from massive text corpora to our specific financial task, even with limited data.
What's Next? The Problem of Polysemy
Word2Vec and its cousins (GloVe, fastText) were a massive leap forward. But they still have one fundamental limitation: each word has only **one** vector representation, regardless of its context.
Consider the word "bank":
- "I need to go to the **bank** to deposit a check." (A financial institution)
- "The canoe drifted towards the river **bank**." (The side of a river)
Word2Vec will produce the *same* vector for "bank" in both sentences. It cannot disambiguate meaning based on context. This is a problem.
In the next lesson, we will explore the state-of-the-art solution to this problem: **contextual embeddings** from large language models like **BERT**, which are built on the Transformer architecture we met in Module 8.