Lesson 9.2: Financial Sentiment Analysis with Dictionaries

This is our first practical NLP project. We will learn to quantify the sentiment of financial text (e.g., a news article or an earnings call transcript) by using a pre-built financial dictionary. This 'lexicon-based' approach is a simple but powerful technique for turning unstructured text into a tradable signal.

Part 1: The Goal - From Words to a Score

Financial markets are driven by narratives, emotions, and expectations. A company's earnings report contains not just numbers, but also management's tone—are they optimistic, cautious, or evasive? A news article can be bullish or bearish. Our goal is to create a model that can read a piece of text and assign it a single **sentiment score**.

The Sentiment Score

We want to compute a score where:

Score > 0: The text is predominantly **Positive**.
Score < 0: The text is predominantly **Negative**.
Score ≈ 0: The text is **Neutral**.

Part 2: The 'Dictionary' (Lexicon) Approach

The simplest way to calculate a sentiment score is to use a pre-defined dictionary, or **lexicon**, where financial words have already been classified as positive or negative. One of the most famous and widely used is the **Loughran-McDonald Financial Sentiment Dictionary**.

This dictionary was created by analyzing tens of thousands of corporate 10-K filings. It contains lists of words that are contextually positive or negative in a financial setting.

Positive Words

Words like: `achieve`, `beneficial`, `excellence`, `growth`, `profitable`, `success`...

Negative Words

Words like: `adverse`, `claims`, `default`, `impairment`, `misstatement`, `volatile`...

(Note that words like "volatile" might be neutral in a general context but are almost always negative in a financial filing).

The Sentiment Scoring Algorithm

The algorithm is incredibly simple:

Step 1: Preprocess the Text. Clean the raw text by converting it to lowercase, removing punctuation, and possibly removing common stop words.
Step 2: Tokenize. Split the cleaned text into a list of individual words (tokens).
Step 3: Count. Count the number of positive words and negative words in the document using your financial lexicon.
Step 4: Calculate the Score. A common formula for the sentiment score is:
$\text{Sentiment} = \frac{(\text{Positive Words} - \text{Negative Words})}{(\text{Positive Words} + \text{Negative Words})}$
This normalizes the score to be between -1 (completely negative) and +1 (completely positive).

Part 3: Python Implementation

Building a Simple Sentiment Scorer

import re

# --- A simplified Loughran-McDonald dictionary ---
lm_positive = {'achieve', 'beneficial', 'excellence', 'growth', 'profitable', 'success'}
lm_negative = {'adverse', 'claims', 'default', 'impairment', 'misstatement', 'volatile'}

def preprocess_text(text):
    """
    Cleans text by lowercasing, removing punctuation and numbers.
    """
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text) # Remove anything that's not a letter or space
    return text

def calculate_sentiment(text):
    """
    Calculates the sentiment score of a text using the lexicon-based approach.
    """
    # 1. Preprocess
    cleaned_text = preprocess_text(text)
    
    # 2. Tokenize
    words = cleaned_text.split()
    
    # 3. Count
    positive_count = sum(1 for word in words if word in lm_positive)
    negative_count = sum(1 for word in words if word in lm_negative)
    
    # 4. Calculate Score
    total_sentiment_words = positive_count + negative_count
    if total_sentiment_words == 0:
        return 0 # Neutral if no sentiment words are found
    
    sentiment_score = (positive_count - negative_count) / total_sentiment_words
    return sentiment_score

# --- Example Usage ---
earnings_call_transcript = """
Despite volatile market conditions and some adverse claims, we achieved profitable growth. 
Our success is a result of operational excellence.
"""

score = calculate_sentiment(earnings_call_transcript)

print(f"Document: '{earnings_call_transcript.strip()}'")
print("-" * 20)
print(f"Positive words found: 4 ('achieved', 'profitable', 'growth', 'success', 'excellence')")
print(f"Negative words found: 3 ('volatile', 'adverse', 'claims')")
print(f"Calculated Sentiment Score: {score:.4f}")
# Expected score: (4 - 3) / (4 + 3) = 1/7 ≈ 0.1429 (Slightly positive)

Part 4: Pros and Cons of the Lexicon Approach

Pros

Simple and Fast: The algorithm is extremely easy to implement and computationally very cheap.
Interpretable: You know exactly why a document received a certain score—you can point to the specific words that drove the calculation.
Domain-Specific: Using a specialized dictionary like Loughran-McDonald provides much better results for financial text than a general-purpose sentiment dictionary.

Cons

Inability to Handle Negation: It cannot understand that "not profitable" has the opposite meaning of "profitable." It would count "profitable" as a positive word.
No Context: It has no understanding of context. A sentence like "we avoided claims of misstatement" would be scored as highly negative because it contains two negative keywords.
Limited Vocabulary: It can only score words that are in its dictionary. It doesn't know that "superb" is similar to "excellent."

What's Next? Learning the Meaning of Words

The lexicon-based approach is a powerful baseline, but its fundamental limitation is that it treats every word as a separate, isolated entity. It has no concept of "meaning" or "similarity."

How can we create a system where the vector for the word "success" is mathematically "close" to the vector for "profitable"? How can we teach a model that the relationship between "France" and "Paris" is similar to the relationship between "Japan" and "Tokyo"?

The solution is to learn vector representations for words based on the context in which they appear. This is the revolutionary idea behind **Word Embeddings**, and in our next lesson, we will explore the algorithm that started it all: **Word2Vec**.

From Text to Numbers: Classic Techniques (Bag-of-Words, TF-IDF)

The Rise of Embeddings: The Intuition of Word2Vec