Lesson 3.2: The Mathematics of a Split: Entropy & Gini Impurity

In our last lesson, we saw that a decision tree learns by asking a series of 'yes/no' questions. But how does it decide which question is the 'best' one to ask at each step? This lesson dives into the elegant mathematics behind that decision, introducing the two primary 'impurity' metrics—Gini Impurity and Entropy—that power the tree's learning process.

Part 1: The Goal - Creating 'Pure' Groups

Imagine you have a basket of 10 fruits: 5 Apples and 5 Oranges. This is an "impure" basket with a 50/50 mix. Your goal is to ask a single question that will split this basket into two new baskets that are as "pure" as possible (i.e., closer to being all Apples or all Oranges).

Which of these questions is better?

"Is the fruit red?" → This might split the basket into `{4 Apples, 1 Orange}` and `{1 Apple, 4 Oranges}`.
"Is the fruit round?" → This is a useless question, as both are round. The split is still `{5 Apples, 5 Oranges}`.

Clearly, Question 1 is better. It reduced the "impurity" of the groups. The algorithm needs a number to quantify this concept of "impurity" so it can compare different questions.

Part 2: The Impurity Metrics

We will explore the two most common metrics used by decision tree algorithms like CART (Classification and Regression Trees).

Metric 1: Gini Impurity

The most common metric for the CART algorithm.

Intuition: Gini Impurity measures the probability of incorrectly classifying a randomly chosen element from the group if it were randomly labeled according to the distribution of labels in that group.

Formula: Gini Impurity (G)

For a node with C classes, where $p_i$ is the fraction of items belonging to class $i$ , the Gini Impurity is:

G = \sum_{i=1}^C p_i (1-p_i) = 1 - \sum_{i=1}^C p_i^2

Perfect Purity (G=0): If a node is 100% one class (e.g., $p_{Apple}=1, p_{Orange}=0$ ), then $G = 1 - (1^2 + 0^2) = 0$ .
Maximum Impurity (G=0.5 for 2 classes): If a node is a 50/50 mix ( $p_{Apple}=0.5, p_{Orange}=0.5$ ), then $G = 1 - (0.5^2 + 0.5^2) = 1 - 0.5 = 0.5$ .

Metric 2: Entropy

A concept borrowed from information theory, used by algorithms like ID3 and C4.5.

Intuition: Entropy measures the level of "disorder," "surprise," or "uncertainty" in a node. A node with high entropy is very mixed up and unpredictable.

Formula: Entropy (H)

For a node with C classes, the Entropy is:

H = -\sum_{i=1}^C p_i \log_2(p_i)

(Note: $0 \log_2 0$ is defined as 0).

Perfect Purity (H=0): If a node is 100% one class ( $p_1=1$ ), then $H = -(1 \log_2 1) = 0$ . There is zero uncertainty.
Maximum Impurity (H=1 for 2 classes): If a node is a 50/50 mix, then $H = -(0.5 \log_2 0.5 + 0.5 \log_2 0.5) = 1$ .

In practice, Gini Impurity and Entropy give very similar results. Gini is often the default because it's slightly faster to compute (no logarithms).

Part 3: The 'Goodness of a Split' - Information Gain

The tree doesn't just want a low impurity; it wants to find the split that causes the largest **reduction** in impurity. This reduction is called **Information Gain**.

Formula: Information Gain

\text{Information Gain} = \text{Impurity(Parent)} - \text{Weighted Average Impurity(Children)}

The weighted average impurity of the children is calculated as:

\sum_{i=1}^{k} \frac{N_i}{N} \text{Impurity(Child}_i)

Where $N$ is the total number of samples at the parent node, and $N_i$ is the number of samples in child node $i$ .

The Learning Algorithm (Revisited)

The **recursive partitioning** algorithm now has a clear objective at every node:

For every feature, loop through every possible split point.
For each potential split, calculate the Information Gain it would produce.
Select the single split (across all features and all values) that has the **highest Information Gain**.
Make that split and create the two new child nodes.
Repeat this process on the children.

What's Next? The Downside of Perfection

This greedy, information-maximizing algorithm is incredibly powerful. If you let it run indefinitely, it will keep splitting and splitting until every single leaf node is perfectly pure (contains only one class). It will achieve 100% accuracy on the training data.

But is this a good thing? No. The tree has just memorized the training data, including all of its random noise. It has failed to learn the general underlying pattern.

In the next lesson, we will explore this fundamental weakness of decision trees—their tendency to **overfit**—and understand why it happens from a bias-variance perspective.

The Intuition of a Decision Tree: How a Tree Learns by Asking Questions

The Achilles' Heel of Trees: Why Single Trees Overfit (High Variance)