Lesson 2.6: Evaluating Classifiers: Precision-Recall vs. ROC/AUC

A single accuracy score can be dangerously misleading. This lesson provides the professional toolkit for evaluating classification models across all decision thresholds. We'll master the Precision-Recall curve for imbalanced data and the industry-standard ROC/AUC curve for a complete picture of model performance.

Part 1: The Problem with a Single Threshold

In our last lesson, we built a Logistic Regression model that outputs a probability, $\hat{p}$ . We made our final classification using a simple rule: if $\hat{p} > 0.5$ , predict 1; otherwise, predict 0. But this 0.5 threshold is completely arbitrary.

Imagine our fraud detection model again. The business might decide that they are extremely risk-averse and want to catch as much fraud as possible, even if it means more false alarms. They might tell us, "Flag any transaction where the probability of fraud is greater than 10% ( $\hat{p} > 0.1$ )."

Conversely, a marketing team sending promotional emails might only want to target users with a very high probability of converting, say $\hat{p} > 0.9$ , to avoid annoying other customers.

A good model shouldn't just be accurate at a single, arbitrary threshold. It should perform well across a whole range of possible thresholds. We need tools to visualize and quantify this performance.

Part 2: The Precision-Recall Curve

The Precision-Recall curve is the best tool for evaluating models on **imbalanced datasets**, like fraud detection or medical screening, where the "positive" class is rare.

It visualizes the inherent tradeoff: to increase Recall (catch more true positives), you almost always have to sacrifice some Precision (you'll have more false positives).

How to Read a Precision-Recall Curve

Imagine a PR curve here: Y-axis is Precision, X-axis is Recall. A good model's curve is pushed up and to the right.

The X-Axis is Recall (Sensitivity): How many of the actual positive cases did we catch?
The Y-Axis is Precision: When we predicted positive, how often were we correct?
The Curve: Each point on the curve represents the Precision/Recall pair for a specific probability threshold. As you move along the curve (by lowering the threshold), Recall increases, but Precision typically decreases.
The Goal: A "perfect" model would be in the top-right corner (Precision=1, Recall=1). A good model has a curve that is pushed as far towards this corner as possible.

Part 3: The ROC Curve and AUC Score

The **Receiver Operating Characteristic (ROC)** curve is the most common evaluation metric for binary classification in academic and industry settings. It plots the tradeoff between two different metrics.

True Positive Rate (TPR)

This is just another name for **Recall** or **Sensitivity**.

\text{TPR} = \frac{TP}{TP + FN}

False Positive Rate (FPR)

What fraction of all actual *negatives* were incorrectly flagged as positive?

\text{FPR} = \frac{FP}{FP + TN}

How to Read a ROC Curve

Imagine a ROC curve here: Y-axis is TPR, X-axis is FPR. The curve starts at (0,0) and ends at (1,1). A good model's curve is bowed up towards the top-left.

The Axes: TPR (good) on the Y-axis vs. FPR (bad) on the X-axis.
The Curve: Each point represents the TPR/FPR pair for a specific threshold.
The Diagonal Line: The line $y=x$ represents a "random guess" model. Any useful model must have a curve above this line.
The Goal: A "perfect" model would be a point in the top-left corner (TPR=1, FPR=0), meaning it catches all positives with zero false alarms. A good model has a curve that is "bowed" as close to this corner as possible.

The Area Under the Curve (AUC)

While the ROC curve is a great visual, we often need a single number to summarize its performance. That number is the **Area Under the (ROC) Curve**, or **AUC**.

AUC = 1.0: A perfect classifier.
AUC = 0.5: A useless classifier, no better than random guessing.
AUC &lt 0.5: A classifier that is systematically wrong (its predictions are the opposite of the truth).

Interpretation:

The AUC has a beautiful, intuitive meaning. It is the probability that a randomly chosen positive sample will have a higher predicted probability than a randomly chosen negative sample. An AUC of 0.85 means that 85% of the time, the model correctly ranks a positive instance higher than a negative one.

Part 4: Python Implementation

Generating and Comparing Curves in Python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_curve, roc_curve, auc, roc_auc_score

# --- 1. Generate sample data ---
np.random.seed(42)
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, n_informative=5,
                           n_redundant=10, n_classes=2, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# --- 2. Train two different models ---
# A simple model
log_reg = LogisticRegression().fit(X_train, y_train)
y_scores_lr = log_reg.predict_proba(X_test)[:, 1]

# A more complex model
rf = RandomForestClassifier(random_state=42).fit(X_train, y_train)
y_scores_rf = rf.predict_proba(X_test)[:, 1]

# --- 3. Calculate and Plot the Precision-Recall Curve ---
precision_lr, recall_lr, _ = precision_recall_curve(y_test, y_scores_lr)
precision_rf, recall_rf, _ = precision_recall_curve(y_test, y_scores_rf)

plt.figure(figsize=(8, 6))
plt.plot(recall_lr, precision_lr, label=f'Logistic Regression (AP={auc(recall_lr, precision_lr):.2f})')
plt.plot(recall_rf, recall_rf, label=f'Random Forest (AP={auc(recall_rf, precision_rf):.2f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.show()

# --- 4. Calculate and Plot the ROC Curve ---
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_scores_lr)
auc_lr = roc_auc_score(y_test, y_scores_lr)

fpr_rf, tpr_rf, _ = roc_curve(y_test, y_scores_rf)
auc_rf = roc_auc_score(y_test, y_scores_rf)

plt.figure(figsize=(8, 6))
plt.plot(fpr_lr, tpr_lr, label=f'Logistic Regression (AUC = {auc_lr:.2f})')
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {auc_rf:.2f})')
plt.plot([0, 1], [0, 1], color='navy', linestyle='--', label='Random Guess')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.show()

What's Next? Understanding the 'Why' of Linear Models

We've now built and learned how to professionally evaluate our two workhorse linear models: Linear Regression (for numbers) and Logistic Regression (for categories).

But all of this has been from a very practical, machine-learning perspective. To truly master these models and understand when they might fail, we must go deeper. We need to understand the formal statistical theory that underlies them.

In the next lesson, we will begin this journey by exploring the **Assumptions of Linear Models**. This will be our bridge from the world of ML back to the rigorous world of econometrics.

Lesson 2.5: From Regression to Classification: The Logic of Logistic Regression

Lesson 2.7: The Assumptions of Linear Models