Lesson 3.1: The Intuition of a Decision Tree
Welcome to Module 3. We are now leaving the world of straight lines and planes behind. This lesson introduces a completely different way of thinking about prediction: the Decision Tree. We'll discover how this intuitive, flowchart-like model learns by asking a series of simple 'yes/no' questions to chop up the feature space into predictive regions.
Part 1: The Limits of Linearity
The linear and logistic regression models from Module 2 are powerful, interpretable, and the workhorses of classical econometrics. But they share a fundamental limitation: they can only capture **linear relationships**. They try to separate data with a single straight line or plane.
What if the relationship in our data is more complex? What if the "story" is not a straight line?
The Core Analogy: The '20 Questions' Game
Imagine you're playing the game "20 Questions." Your friend is thinking of an object, and you must guess what it is by asking yes/no questions.
- Your first question might be, "Is it alive?"
- If 'yes,' your next question is, "Is it an animal?"
- If 'yes,' "Does it live in water?"
Each question you ask slices the "universe" of all possible objects in half. You are building a **decision tree** in your head to narrow down the possibilities. This is *exactly* how a Decision Tree model works. It doesn't try to find a single, complex formula. It learns a sequence of simple, binary questions to partition the data into pure regions.
Part 2: Visualizing a Decision Tree
Let's use a simple financial example. We want to predict if a stock's return tomorrow will be "Up" or "Down" (a classification problem) based on two features:
- X₁: Yesterday's Return (%)
- X₂: Trading Volume (relative to its average)
A decision tree doesn't find a line. It finds the best *splits* in the data.
Top Box: "Yesterday's Return < 0.5%?"
↙ (Yes) ↘ (No)
Box 2: "Volume > 1.2?" Box 3: "Predict DOWN" (Leaf)
↙ (Yes) ↘ (No)
Box 4: "Predict UP" Box 5: "Predict DOWN"
How to read this tree:
- The Root Node (The First Question): The tree starts by asking, "Was yesterday's return less than 0.5%?". This splits all the data points into two branches.
- Internal Nodes (More Questions): If the answer was "yes," it moves to the next node and asks another question: "Was the volume greater than 1.2 times its average?"
- Leaf Nodes (The Final Prediction): At the end of a path, we reach a "leaf." This leaf contains the final prediction for all data points that end up in that region. For example, if a stock had a return of -1% (Yes to Q1) and volume of 1.5x (Yes to Q2), our prediction is "Up".
Part 3: How Does a Tree Learn? Finding the 'Best' Split
This is the core algorithm. How does the tree know which question to ask at each step? It tries every possible split on every feature and chooses the one that makes the resulting groups "purer."
"Purity" is a measure of how well the split separates the classes. The goal is to create child nodes that are more dominated by a single class (e.g., all "Up" or all "Down") than the parent node was. We need a number to measure this "impurity."
The Impurity Metrics: Gini vs. Entropy
Gini Impurity
Measures the probability of incorrectly classifying a randomly chosen element if it were randomly labeled according to the distribution of labels in the node.
A Gini of 0 means perfect purity (all one class). A Gini of 0.5 means maximum impurity (50/50 split).
Entropy
A concept from information theory that measures the level of "disorder," "surprise," or "uncertainty" in a node.
An Entropy of 0 means perfect purity. An Entropy of 1 (for binary classification) means maximum impurity.
The Learning Algorithm (Recursive Partitioning)
The tree is built using a greedy, top-down process called **recursive partitioning**:
- Start at the root node with all your training data.
- For every feature, find the best split point (e.g., "Return < 0.5%?") that maximizes the **Information Gain** (the reduction in impurity from parent to children).
- Choose the single feature and split point that gives the highest Information Gain.
- Split the data into two new child nodes based on that rule.
- For each child node, repeat the process from Step 2.
- Stop when a node is pure, has too few samples, or you reach a pre-defined maximum depth.
Part 4: Regression Trees
Trees can also be used for regression (predicting a number). The process is almost identical, but the way we measure "impurity" and make predictions changes.
- Impurity Metric: Instead of Gini or Entropy, we use **Variance** (or Mean Squared Error). The best split is the one that results in the largest reduction in variance.
- Prediction: The prediction at a leaf node is not the majority class, but the **average** of all the target values of the training samples that fall into that leaf.
- Highly Interpretable: The flowchart structure is easy for humans to understand and explain to business stakeholders. This is a huge advantage over "black box" models.
- Handles Non-Linearity: Can capture complex, non-linear relationships and interactions between features without any special feature engineering.
- No Feature Scaling Needed: Since it only cares about split points, the scale of the features does not matter.
- Prone to Overfitting: A single, deep tree is a high-variance model. It will perfectly memorize the training data, including its noise, and will not generalize well to new data. This is its single biggest flaw.
- Instability: Small changes in the input data can lead to a completely different tree structure.
- Can't Extrapolate: A tree can only predict values that are within the range of the target values seen in the training data for a given leaf.
Strengths
Weaknesses
What's Next? Solving the Overfitting Problem
We have a powerful, intuitive model that can learn complex patterns. But its greatest strength—its flexibility—is also its greatest weakness, as it overfits very easily.
How do we fix this? The answer is one of the most powerful ideas in modern machine learning: **Ensemble Methods**. We will learn that by combining hundreds or thousands of these simple, unstable decision trees in a clever way, we can create an incredibly powerful, stable, and accurate model.
In our next lessons, we will explore the theory behind this "wisdom of the crowd" and build the two champion ensemble models: **Random Forest** and **Gradient Boosting Machines (XGBoost)**.