Lesson 3.3: The Achilles' Heel of Trees: Why Single Trees Overfit

Decision trees are wonderfully intuitive and flexible. But this flexibility is also their greatest weakness. This lesson explains why a single, unconstrained decision tree is a 'high-variance' model that is almost guaranteed to overfit the training data, and why this makes it a poor predictor on its own.

Part 1: The Greedy Student

In the last lesson, we saw that a decision tree learns by finding the split that provides the maximum **Information Gain** at every step. This is a "greedy" algorithm. It makes the locally optimal choice at each node, without any consideration for the global structure of the tree.

If we don't give it a "stop" signal, it will continue this process until it can't split anymore. This happens when every leaf node is **perfectly pure**—containing only data points from a single class.

The Core Analogy: The Over-Eager Intern

Imagine you give an intern a dataset of past customers and ask them to build a system to predict if a new customer will buy a product. The intern is very eager to get 100% accuracy on the historical data.

They start with a broad rule: "Did the customer visit our pricing page?"
This isn't perfect, so they add another rule: "...AND are they from North America?"
Still not perfect. "...AND is it a Tuesday?"
"...AND their company name starts with 'A'?"

They continue adding more and more specific, convoluted rules until they have a rule for every single customer in the historical dataset. They have built a model that achieves 100% accuracy on the data they were given.

The problem: This model is completely useless for predicting new customers. It hasn't learned the general *signal* of customer intent; it has memorized the *noise* and random coincidences of the specific dataset it was shown.

This is exactly what an unconstrained decision tree does. It overfits.

Part 2: The Bias-Variance Perspective

Let's analyze this behavior using the framework from Lesson 1.1.

A Single, Deep Decision Tree is a High-Variance, Low-Bias Model

Low Bias: A fully grown tree is extremely flexible. It can capture any complex, non-linear pattern in the data. Because it can perfectly fit the training data, its "bias" (the error from overly simplistic assumptions) is very low. It doesn't assume the data is linear or anything else.
High Variance: This is the killer problem. The structure of the tree is highly sensitive to the specific training data. If you change just a few data points, the algorithm might pick a completely different "best first split," leading to a cascade of different decisions and a totally different final tree structure. The model's predictions are not stable. They have high variance.

Part 3: Visualizing the Overfit

The overfitting of a decision tree is easy to visualize by looking at the decision boundary it creates.

A Simple, Robust Boundary

Imagine a scatter plot of blue and red dots, mostly separable. A single straight line or a simple L-shape divides them well.

A simple model (like Logistic Regression or a shallow tree) captures the main pattern.

An Overfit, High-Variance Boundary

Imagine the same scatter plot, but the boundary is a jagged, complex, multi-stepped shape that snakes around every single data point to classify it perfectly.

A deep decision tree creates these axis-aligned, rectangular "islands" to perfectly isolate every point. This complex boundary will fail badly on new data.

The Verdict on Single Trees

A single, unconstrained decision tree is a brilliant theoretical tool for understanding how to partition data, but it is generally a **poor predictive model** on its own because of its extreme tendency to overfit.

This leads to a critical conclusion:

Nobody uses a single decision tree in practice for prediction.

Instead, we use single trees as the **building blocks** for much more powerful **ensemble methods** (like Random Forest and Gradient Boosting), which are the subject of Module 4.

What's Next? How Do We Fix It?

We've identified the disease: high variance and overfitting. How do we treat it?

There are two primary approaches:

Constrain the tree *during* its growth to stop it from getting too complex.
Let the tree grow fully and then *prune it back* to a simpler form.

In the next lesson, we will explore these techniques of **Pruning and Hyperparameter Tuning**, which are the practical tools for controlling a tree's complexity and finding the sweet spot in the bias-variance tradeoff.

The Mathematics of a Split: Entropy and Gini Impurity

Taming the Tree: Pruning and Setting Hyperparameters (e.g., max_depth)