Lesson 9.1: From Text to Numbers: Classic Techniques

Welcome to Module 9. We now enter the world of Natural Language Processing (NLP). Our first and most fundamental challenge is to convert unstructured, messy text into a structured, numerical format that a machine learning model can understand. This lesson introduces the classic but powerful techniques of Bag-of-Words and TF-IDF.

Part 1: The Core Problem - Models Only Understand Numbers

A machine learning model, whether it's a linear regression or a deep neural network, can't read. It can only perform mathematical operations on vectors and matrices of numbers. The entire field of NLP is built on finding clever ways to represent the meaning and structure of language numerically.

Our first task is to take a collection of documents (our "corpus") and convert each one into a numerical vector.

Part 2: Bag-of-Words (BoW) - The 'Shopping Bag' Approach

The **Bag-of-Words** model is the simplest and most intuitive way to do this. It completely ignores grammar, syntax, and word order, and represents a document simply as the set of words it contains, like a shopping bag full of groceries.

The BoW Algorithm

  1. Step 1: Create a Vocabulary.Collect every single unique word from your entire corpus of documents and create a master list. This is your "vocabulary." (e.g., `['a', 'and', 'buy', 'cell', 'is', 'it', 'sell', 'stock', 'the', 'this']`)
  2. Step 2: Create a Vector for Each Document.For each document, create a vector that is the same size as your vocabulary. Then, for each word in the vocabulary, simply count how many times it appears in that document.
A Simple Example

Corpus:

  • Doc 1: "buy the stock"
  • Doc 2: "sell the stock"

Vocabulary: `['buy', 'sell', 'stock', 'the']`

BoW Vectors:

  • Doc 1 Vector: `[1, 0, 1, 1]`
  • Doc 2 Vector: `[0, 1, 1, 1]`

We have successfully converted text into numbers that a model can use!

Pros: Simple, fast, and surprisingly effective for many tasks like topic classification.

Cons: Loses all word order and context. "Man bites dog" and "Dog bites man" have the exact same BoW representation. It also gives equal importance to common words like "the" and important keywords like "default."

Part 3: TF-IDF - The 'Smart Count' Approach

How can we improve upon Bag-of-Words? We can be smarter about how we weight the word counts. A word like "the" appears in almost every document; it contains very little unique information. A word like "arbitrage" is rare; if it appears, it's a very strong signal about the document's topic.

Term Frequency-Inverse Document Frequency (TF-IDF) is a weighting scheme that captures this idea. It gives a high score to words that are frequent in one document but rare across all other documents.

The TF-IDF Calculation

tf-idf(t,d)=tf(t,d)×idf(t)\text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t)
  • Term Frequency (tf): "How often does this term appear in this document?"
    tf(t,d)=Number of times term t appears in document dTotal number of terms in document d\text{tf}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d}
  • Inverse Document Frequency (idf): "How rare is this term across all documents?"
    idf(t)=log(Total number of documentsNumber of documents containing term t)\text{idf}(t) = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents containing term } t}\right)
    This `log` term is key. If a word appears in every document, the ratio is 1, and log(1)=0\log(1)=0. If it's a rare word, the ratio is large, and the `idf` score is high.

The Result: A 'Smart' Vector

An ML model trained on TF-IDF vectors will pay more attention to the rare, informative keywords and less attention to the common, "stop words," leading to much better performance.

Part 4: Python Implementation

BoW and TF-IDF in Scikit-learn

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

corpus = [
    'This is the first document about finance.',
    'This document is the second document about data.',
    'And this is the third one about finance and data.'
]

# --- Bag-of-Words ---
print("--- Bag-of-Words (CountVectorizer) ---")
count_vectorizer = CountVectorizer()
X_bow = count_vectorizer.fit_transform(corpus)
print("Vocabulary:", count_vectorizer.get_feature_names_out())
print("BoW Matrix (sparse):\n", X_bow.toarray())


# --- TF-IDF ---
print("\n--- TF-IDF Vectorizer ---")
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus)
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF Matrix (sparse):\n", X_tfidf.toarray())

# Note how in the TF-IDF matrix, words like 'and', 'is', 'the' which appear
# everywhere get lower scores, while unique words like 'first', 'second', 'third'
# get higher scores within their respective documents.

What's Next? Finding Meaning in Text

We've successfully turned text into numbers. This is a huge step. But our representation is still "dumb." Our model has no idea that the word "good" is similar to "great" but the opposite of "terrible." The vectors for these words are just as far apart as any other two words.

How can we create a vector space where words with similar meanings are located close to each other? This is the revolutionary idea of **Word Embeddings**.

In our next lesson, we will explore our first practical NLP task: **Financial Sentiment Analysis**, using pre-built dictionaries to classify text as positive or negative.