Lesson 9.5: Key NLP Tasks for Finance

We've learned how to turn text into meaningful vectors. Now we apply this to solve two critical problems for a quantitative analyst: automatically extracting specific pieces of information (like company names) and discovering the hidden thematic structure in a vast sea of documents.

Part 1: Named Entity Recognition (NER) - The 'Who' and 'What'

A financial news article or report is full of structured entities hidden within unstructured text. **Named Entity Recognition (NER)** is the task of locating and classifying these named entities into pre-defined categories.

The Core Analogy: The 'Highlighter Pen' for Text

An NER model acts like a set of magical highlighter pens. You give it a sentence, and it automatically highlights the important nouns and classifies them.

"Shares of Apple Inc. (AAPL) rose by $5.20 after Tim Cook announced record sales in China."

Blue Pen (ORG): Highlights organizations.
Purple Pen (MONEY): Highlights monetary values.
Green Pen (PERSON): Highlights people's names.
Yellow Pen (GPE): Highlights geopolitical entities (countries, cities).

How is this useful for a Quant?

NER is the engine that turns a firehose of news into a structured database of events. A quant can build a system that:

Reads every news article in real-time.
Uses NER to tag every company mentioned.
Uses sentiment analysis (from Lesson 9.2) to score the article's tone.
Creates a real-time signal: "Positive sentiment news just released for ticker AAPL." This can be a feature in a high-frequency trading model.

Part 2: Topic Modeling - The 'What are they talking about?'

Imagine you have 10,000 research reports. What are the main themes being discussed? Is the market worried about inflation, technology, or geopolitics? Manually reading them is impossible. **Topic Modeling** is an unsupervised learning technique that automatically discovers the abstract "topics" that occur in a collection of documents.

The most famous algorithm for this is **Latent Dirichlet Allocation (LDA)**.

The Core Analogy: The 'Document Smoothie' Maker

LDA assumes that each document is a **mixture of topics**, and each topic is a **mixture of words**.

It works like a smoothie maker in reverse:

You show the algorithm thousands of document "smoothies."
The algorithm's job is to figure out what the original "fruit" ingredients (the topics) must have been, and what proportion of each fruit went into each smoothie.
The output is not one "label" per document. Instead, it's a recipe.

Example Output for an earnings call:

Topic 1 (Inflation): 40%
Topic 2 (Supply Chain): 30%
Topic 3 (AI & Growth): 20%
Topic 4 (Competition): 10%

How is this useful for a Quant?

Topic modeling allows quants to analyze market narratives at a massive scale.

Tracking Macro Themes: A quant can run LDA on all news articles every day to create time series like "Prevalence of Inflation Topic over time." A sudden spike in this series can be a powerful macro indicator.
Feature Engineering: The topic proportions for a company's 10-K filing can be used as features in a model. For example, a company whose reports are shifting from "growth" topics to "litigation" topics might be a good short candidate.

What's Next? Building the Pipeline

We've now learned about specific, high-value tasks we can perform on text data. We can extract entities, analyze sentiment, and discover hidden topics.

But how do we do this systematically on a stream of thousands of documents per day? This isn't just a modeling problem; it's a data engineering problem.

In the next lesson, we will outline the practical software architecture of an **Information Extraction Pipeline**, showing how to build a system that can ingest, parse, analyze, and store insights from a continuous stream of financial documents.

The State of the Art: Contextual Embeddings with BERT

Information Extraction: Parsing Financial Reports and News Feeds