Lesson 2.8: Capstone Project - Building a Credit Default Predictor

This is the final exam for Module 2. We will apply everything we have learned—from data splitting and scaling to Logistic Regression and model evaluation—to solve a classic quantitative finance problem. We will build, train, and interpret a model to predict the probability of loan default.

Part 1: The Problem and The Goal

The Business Problem: A bank wants to improve its lending decisions. They need a model that can assess the risk of a new loan applicant defaulting on their loan. A good model will help the bank approve more creditworthy applicants while rejecting those who are too risky, thereby reducing financial losses.

The Machine Learning Goal: We will build a **binary classification model**. Given a set of features for a loan applicant (e.g., income, loan amount, credit history), the model will output a **probability of default**. This probability can then be used to make a "Default" (1) or "No Default" (0) decision.

The Model of Choice: Logistic Regression

We will use Logistic Regression for this task. It's the perfect tool because:

  • It's highly **interpretable**. We can examine the coefficients to understand exactly which factors are driving default risk, which is crucial for regulatory compliance.
  • It's a powerful and robust **baseline model**. Any more complex model (like XGBoost or a Neural Network) would need to prove it can significantly outperform this simple, reliable workhorse.

Part 2: The End-to-End Workflow

The Professional's Checklist

We will follow the full machine learning workflow we've established in this module:

  1. Data Preparation: Load the data, define features (X) and the target (y), and perform a train-test split.
  2. Feature Scaling: Apply `StandardScaler` to our training data to ensure all features are on a common scale.
  3. Model Training: Fit a Logistic Regression model to the scaled training data.
  4. Prediction & Evaluation: Make predictions on the unseen test set and evaluate the model's performance using a confusion matrix, classification report, and ROC/AUC curve.
  5. Interpretation: Analyze the model's coefficients to understand the key drivers of default risk.

Part 3: The Complete Python Implementation

We will use a publicly available dataset on lending to walk through the code. The full, executable code is provided below.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, roc_auc_score

# --- 1. Data Preparation ---
# For this example, we will simulate a dataset that mimics real lending data.
# In a real project, you would load this from a CSV.
np.random.seed(42)
n_samples = 2000
# Features
credit_score = np.random.randint(500, 850, n_samples)
income = np.random.randint(30000, 250000, n_samples)
loan_amount = np.random.randint(5000, 50000, n_samples)
# The probability of default is our "true" underlying model
prob_default = 1 / (1 + np.exp(-(-5 + 0.01*credit_score - 0.00002*income - 0.0001*loan_amount)))
# Generate the binary outcome (0 = Repaid, 1 = Default)
default = (np.random.rand(n_samples) < (1-prob_default)).astype(int)

df = pd.DataFrame({'credit_score': credit_score, 'income': income, 'loan_amount': loan_amount, 'default': default})

# Define Features (X) and Target (y)
X = df[['credit_score', 'income', 'loan_amount']]
y = df['default']

# Perform a stratified train-test split to preserve class proportions
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)
print(f"Training set size: {X_train.shape[0]}, Test set size: {X_test.shape[0]}")

# --- 2. Feature Scaling ---
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test) # IMPORTANT: Use the scaler fitted on the training data

# --- 3. Model Training ---
# We use class_weight='balanced' to handle the imbalanced nature of default data
model = LogisticRegression(class_weight='balanced', random_state=42)
model.fit(X_train_scaled, y_train)

# --- 4. Prediction & Evaluation ---
y_pred = model.predict(X_test_scaled)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1] # Probabilities for the 'Default' class

print("\n--- Model Evaluation ---")
# Classification Report (Precision, Recall, F1-Score)
print(classification_report(y_test, y_pred, target_names=['Repaid', 'Default']))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Repaid', 'Default'], yticklabels=['Repaid', 'Default'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()

# ROC Curve and AUC Score
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
auc_score = roc_auc_score(y_test, y_pred_proba)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'Logistic Regression (AUC = {auc_score:.2f})')
plt.plot([0, 1], [0, 1], color='navy', linestyle='--', label='Random Guess')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend()
plt.show()

# --- 5. Interpretation ---
coefs = pd.DataFrame(model.coef_[0], X.columns, columns=['Coefficient'])
coefs['Odds Ratio'] = np.exp(coefs['Coefficient'])
print("\n--- Coefficient Interpretation (Odds Ratios) ---")
print(coefs)
Interpreting the Final Results

    After running the code, a quant would summarize their findings for the bank's risk committee:

    • Overall Performance (AUC): "Our model achieves an AUC of [e.g., 0.85], indicating it has strong discriminatory power. It is substantially better than random guessing at distinguishing between defaulters and non-defaulters."
    • Precision/Recall Tradeoff: "The model's recall for the 'Default' class is [e.g., 0.75], meaning we successfully identify 75% of all actual defaulters. The precision is [e.g., 0.60], meaning that when our model flags a customer as a likely default, it is correct 60% of the time."
    • Coefficient Insights (from Odds Ratios):
      • `credit_score`: "The odds ratio for credit score is [e.g., 0.95]. This means a one-point increase in credit score decreases the odds of default by 5%, holding other factors constant. This is a significant risk-reducing factor."
      • `income`: "The odds ratio for income is [e.g., 1.02]. This is close to 1 and likely not statistically significant, suggesting income is not a major predictor in our model after accounting for other factors."
      • `loan_amount`: "The odds ratio for loan amount is [e.g., 1.08]. A one-unit increase in loan amount increases the odds of default by 8%. Larger loans are riskier."

Congratulations! You Have Completed Module 2

You have now completed a comprehensive, deep dive into the world of linear models. You have mastered not only the "how" (the code) but the "why" (the mathematics) and the "so what" (the diagnostics and interpretation).

What's Next in Your Journey?

Linear models are powerful, but they have their limits. They are fundamentally designed to find **linear** relationships in data. What happens when the patterns are more complex and non-linear?

In **Module 3: Tree-Based Models & Non-Linearity**, we will explore a completely different class of models. We will learn how Decision Trees work by asking a series of simple questions to partition the data, and then how to combine hundreds of these simple trees into powerful "ensemble" models like Random Forest and the champion of them all, XGBoost.