Lesson 9.1: From Text to Numbers: Classic Techniques
Welcome to Module 9. We now enter the world of Natural Language Processing (NLP). Our first and most fundamental challenge is to convert unstructured, messy text into a structured, numerical format that a machine learning model can understand. This lesson introduces the classic but powerful techniques of Bag-of-Words and TF-IDF.
Part 1: The Core Problem - Models Only Understand Numbers
A machine learning model, whether it's a linear regression or a deep neural network, can't read. It can only perform mathematical operations on vectors and matrices of numbers. The entire field of NLP is built on finding clever ways to represent the meaning and structure of language numerically.
Our first task is to take a collection of documents (our "corpus") and convert each one into a numerical vector.
Part 2: Bag-of-Words (BoW) - The 'Shopping Bag' Approach
The **Bag-of-Words** model is the simplest and most intuitive way to do this. It completely ignores grammar, syntax, and word order, and represents a document simply as the set of words it contains, like a shopping bag full of groceries.
The BoW Algorithm
- Step 1: Create a Vocabulary.Collect every single unique word from your entire corpus of documents and create a master list. This is your "vocabulary." (e.g., `['a', 'and', 'buy', 'cell', 'is', 'it', 'sell', 'stock', 'the', 'this']`)
- Step 2: Create a Vector for Each Document.For each document, create a vector that is the same size as your vocabulary. Then, for each word in the vocabulary, simply count how many times it appears in that document.
Corpus:
- Doc 1: "buy the stock"
- Doc 2: "sell the stock"
Vocabulary: `['buy', 'sell', 'stock', 'the']`
BoW Vectors:
- Doc 1 Vector: `[1, 0, 1, 1]`
- Doc 2 Vector: `[0, 1, 1, 1]`
We have successfully converted text into numbers that a model can use!
Pros: Simple, fast, and surprisingly effective for many tasks like topic classification.
Cons: Loses all word order and context. "Man bites dog" and "Dog bites man" have the exact same BoW representation. It also gives equal importance to common words like "the" and important keywords like "default."
Part 3: TF-IDF - The 'Smart Count' Approach
How can we improve upon Bag-of-Words? We can be smarter about how we weight the word counts. A word like "the" appears in almost every document; it contains very little unique information. A word like "arbitrage" is rare; if it appears, it's a very strong signal about the document's topic.
Term Frequency-Inverse Document Frequency (TF-IDF) is a weighting scheme that captures this idea. It gives a high score to words that are frequent in one document but rare across all other documents.
The TF-IDF Calculation
- Term Frequency (tf): "How often does this term appear in this document?"
- Inverse Document Frequency (idf): "How rare is this term across all documents?"This `log` term is key. If a word appears in every document, the ratio is 1, and . If it's a rare word, the ratio is large, and the `idf` score is high.
The Result: A 'Smart' Vector
An ML model trained on TF-IDF vectors will pay more attention to the rare, informative keywords and less attention to the common, "stop words," leading to much better performance.
Part 4: Python Implementation
BoW and TF-IDF in Scikit-learn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
corpus = [
'This is the first document about finance.',
'This document is the second document about data.',
'And this is the third one about finance and data.'
]
# --- Bag-of-Words ---
print("--- Bag-of-Words (CountVectorizer) ---")
count_vectorizer = CountVectorizer()
X_bow = count_vectorizer.fit_transform(corpus)
print("Vocabulary:", count_vectorizer.get_feature_names_out())
print("BoW Matrix (sparse):\n", X_bow.toarray())
# --- TF-IDF ---
print("\n--- TF-IDF Vectorizer ---")
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus)
print("Vocabulary:", tfidf_vectorizer.get_feature_names_out())
print("TF-IDF Matrix (sparse):\n", X_tfidf.toarray())
# Note how in the TF-IDF matrix, words like 'and', 'is', 'the' which appear
# everywhere get lower scores, while unique words like 'first', 'second', 'third'
# get higher scores within their respective documents.
What's Next? Finding Meaning in Text
We've successfully turned text into numbers. This is a huge step. But our representation is still "dumb." Our model has no idea that the word "good" is similar to "great" but the opposite of "terrible." The vectors for these words are just as far apart as any other two words.
How can we create a vector space where words with similar meanings are located close to each other? This is the revolutionary idea of **Word Embeddings**.
In our next lesson, we will explore our first practical NLP task: **Financial Sentiment Analysis**, using pre-built dictionaries to classify text as positive or negative.