Lesson 9.3: The Rise of Embeddings: The Intuition of Word2Vec

The Bag-of-Words model is useful but 'dumb'—it has no concept of meaning. This lesson introduces the revolutionary idea of word embeddings. We will explore the intuition behind the Word2Vec algorithm, which learns dense, meaningful vector representations for words by training a neural network to predict a word from its context.

Part 1: The Problem with 'One-Hot' Vectors

The Bag-of-Words and TF-IDF models we learned create vectors that are very **high-dimensional** (the size of the vocabulary) and **sparse** (mostly zeros). But their biggest flaw is that every word is perfectly **orthogonal**.

In a vocabulary of 10,000 words, the vector for "good" might be `[0,0,...,1,...,0]` and the vector for "great" might be `[0,...,1,...,0,0]`. The dot product of these two vectors is zero. Mathematically, they are as unrelated as "good" and "chainsaw."

We need a new representation—a **dense embedding**—where similar words are close to each other in vector space.

Part 2: The Core Insight of Word2Vec (The 'Distributional Hypothesis')

The Core Idea: 'You shall know a word by the company it keeps.'

This linguistic idea is the heart of Word2Vec, developed by Tomas Mikolov at Google in 2013. The algorithm doesn't learn the "definition" of a word. Instead, it learns a vector representation for each word such that words that appear in similar contexts will have similar vectors.

Consider these sentences:

"The stock had a **strong** Q3 performance."
"The stock had a **robust** Q3 performance."

The words "strong" and "robust" appear surrounded by almost identical context words ("a", "Q3", "performance"). Word2Vec is designed to force their vectors to be close together.

The 'Fake Task' - How Word2Vec Learns

Word2Vec is a simple, shallow neural network that is trained on a "fake" supervised learning task. There are two main architectures:

1. Continuous Bag-of-Words (CBOW)

The Task: Predict a target word based on its surrounding context words.

Given `{"the", "cat", "on", "the", "mat" }`, predict `sat`.

2. Skip-gram

The Task: Predict the surrounding context words based on a single target word.

Given `sat`, predict `{"the", "cat", "on", "the", "mat" }`.

Skip-gram is generally preferred as it performs better for infrequent words. We will focus on its intuition.

Part 3: A Walkthrough of the Skip-gram Architecture

Let's imagine our vocabulary has 10,000 words and we want to create embeddings of size 300 (i.e., a 300-dimensional vector for each word).

Imagine a diagram: Input Word -> Embedding Layer -> Output Layer -> Predicted Context Words.

The Input: A single word, represented as a one-hot encoded vector of size 10,000.
The Hidden Layer (The Embedding Matrix): This is the magic. The hidden layer is simply a giant lookup table—a $10,000 \times 300$ matrix of weights. Let's call it $\mathbf{E}$ . When our one-hot input vector (which is all zeros except for a single 1) is multiplied by this matrix, it effectively just **selects the corresponding row**. The output of this layer is the 300-dimensional "word vector" for our input word.
The Output Layer: This layer has 10,000 neurons, one for each word in the vocabulary, with a softmax activation function. It produces a probability distribution over the entire vocabulary.
The Training Process: The network is trained using Gradient Descent. Its goal is to adjust the weights in the embedding matrix $\mathbf{E}$ so that the output probability distribution gets closer and closer to predicting the actual context words from the training text.

The 'Aha!' Moment: The Byproduct is the Treasure

After training on billions of words, the final neural network is actually **thrown away**. The "fake" task of predicting context words was just a means to an end.

The real treasure is the trained weight matrix of the hidden layer—the $10,000 \times 300$ embedding matrix, $\mathbf{E}$ . This matrix is our word-to-vector dictionary. Each row is a dense, 300-dimensional vector that captures the "meaning" of a word based on all the contexts it appeared in.

Part 4: The Power of Dense Vectors - Vector Arithmetic

These learned embeddings are not just random numbers; they capture complex semantic relationships that allow us to perform "vector arithmetic" with words.

Famous Example: King - Man + Woman = Queen

\text{vec}(\text{''King''}) - \text{vec}(\text{''Man''}) + \text{vec}(\text{''Woman''}) \approx \text{vec}(\text{''Queen''})

The vector from "Man" to "King" captures the concept of "royalty." Adding this "royalty" vector to "Woman" lands us very close to the vector for "Queen" in the embedding space.

Financial Example:

\text{vec}(\text{''MSFT''}) - \text{vec}(\text{''Technology''}) + \text{vec}(\text{''Energy''}) \approx \text{vec}(\text{''XOM''})

This allows models to understand analogies and relationships without being explicitly programmed to do so.

The Word2Vec Revolution

Word2Vec learns **dense**, low-dimensional representations of words.
It works by training a neural network on a "fake" task of predicting context.
The trained hidden layer weights become the **word embeddings**.
These embeddings capture semantic relationships, allowing for vector arithmetic with words.
Using pre-trained embeddings (like GloVe or fastText) allows us to transfer knowledge from massive text corpora to our specific financial task, even with limited data.

What's Next? The Problem of Polysemy

Word2Vec and its cousins (GloVe, fastText) were a massive leap forward. But they still have one fundamental limitation: each word has only **one** vector representation, regardless of its context.

Consider the word "bank":

"I need to go to the **bank** to deposit a check." (A financial institution)
"The canoe drifted towards the river **bank**." (The side of a river)

Word2Vec will produce the *same* vector for "bank" in both sentences. It cannot disambiguate meaning based on context. This is a problem.

In the next lesson, we will explore the state-of-the-art solution to this problem: **contextual embeddings** from large language models like **BERT**, which are built on the Transformer architecture we met in Module 8.

Financial Sentiment Analysis: Building a Lexicon-Based Scorer

The State of the Art: Contextual Embeddings with BERT