Lesson 4.8: Capstone Project - Forecasting Volatility with Ensembles

This is the final exam for Module 4. We will act as quantitative researchers to solve a realistic and important financial problem: forecasting near-term stock volatility. We will apply our two champion models—Random Forest and XGBoost—to this task, walking through the entire process from feature engineering to model comparison and interpretation.

Part 1: The Problem and The Goal

The Business Problem: An options trading desk needs a reliable forecast of near-term volatility to price their options and manage their risk. While GARCH models (Module 5) are the classical tool, machine learning models can often provide more accurate forecasts by capturing complex, non-linear relationships between volatility and various market indicators.

The Machine Learning Goal: We will frame this as a **regression problem**. Our goal is to predict the **realized volatility over the next 21 trading days (1 month)** using historical data available up to today.

Defining the Target Variable: Realized Volatility

We need a way to measure the "true" volatility that actually occurred over a period. The standard measure is **realized volatility**, calculated as the annualized standard deviation of daily log returns over a specific window.

\text{RealizedVol}_{t, t+H} = \sqrt{\frac{252}{H} \sum_{i=1}^H r_{t+i}^2}

(Assuming mean return is close to zero, which is standard for daily returns).

Our target variable, $y_t$ , will be the realized volatility over the *next* 21 days.

Part 2: The End-to-End Workflow

The Professional's Checklist

We will follow a complete data science workflow:

Data Preparation: Load daily price data for a stock (e.g., SPY).
Feature Engineering: Create our target variable (future volatility) and a set of predictive features from the historical data (e.g., past volatility, momentum indicators, etc.).
Train-Test Split: Crucially, for time series data, we must perform a **walk-forward split**, not a random shuffle, to avoid look-ahead bias.
Model Training & Tuning: Train and tune both a Random Forest and an XGBoost model using `GridSearchCV`.
Evaluation: Compare the models' performance on the unseen test set using regression metrics like RMSE and R-squared.
Interpretation: Analyze the feature importances from both models to understand what drives volatility.

Part 3: The Complete Python Implementation

We will now walk through the complete, executable code for this project.

import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, TimeSeriesSplit, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
from sklearn.metrics import mean_squared_error, r2_score

# --- 1. Data Preparation ---
spy = yf.download('SPY', start='2005-01-01', end='2023-12-31')
spy['log_return'] = np.log(spy['Adj Close'] / spy['Adj Close'].shift(1))
spy.dropna(inplace=True)

# --- 2. Feature Engineering ---
# Target Variable: Future 21-day realized volatility
spy['realized_vol_21d'] = spy['log_return'].rolling(window=21).std() * np.sqrt(252)

# Our target is the *future* volatility, so we shift it back
spy['target_vol'] = spy['realized_vol_21d'].shift(-21)

# Predictive Features (using past data only)
spy['past_vol_5d'] = spy['log_return'].rolling(window=5).std() * np.sqrt(252)
spy['past_vol_21d'] = spy['realized_vol_21d']
spy['past_vol_63d'] = spy['log_return'].rolling(window=63).std() * np.sqrt(252)
spy['past_return_5d'] = spy['log_return'].rolling(window=5).sum()
spy['past_return_21d'] = spy['log_return'].rolling(window=21).sum()

# Drop rows with NaNs created by rolling windows and the final target shift
spy.dropna(inplace=True)

# Define Features (X) and Target (y)
features = ['past_vol_5d', 'past_vol_21d', 'past_vol_63d', 'past_return_5d', 'past_return_21d']
X = spy[features]
y = spy['target_vol']

# --- 3. Time Series Train-Test Split ---
# We use the first 80% of data for training/validation and the last 20% for the final hold-out test
train_size = int(len(X) * 0.8)
X_train, X_test = X.iloc[:train_size], X.iloc[train_size:]
y_train, y_test = y.iloc[:train_size], y.iloc[train_size:]

# --- 4. Model Training & Tuning ---
# Use TimeSeriesSplit for cross-validation to respect the time order
tscv = TimeSeriesSplit(n_splits=5)

# --- Random Forest ---
print("Tuning Random Forest...")
rf = RandomForestRegressor(random_state=42, n_jobs=-1)
rf_params = {'n_estimators': [100, 300], 'max_depth': [5, 10], 'min_samples_leaf': [10, 20]}
rf_grid = GridSearchCV(rf, rf_params, cv=tscv, scoring='neg_root_mean_squared_error', n_jobs=-1)
rf_grid.fit(X_train, y_train)
best_rf = rf_grid.best_estimator_

# --- XGBoost ---
print("Tuning XGBoost...")
xgbr = xgb.XGBRegressor(objective='reg:squarederror', random_state=42, n_jobs=-1)
xgb_params = {'n_estimators': [100, 300], 'max_depth': [3, 5], 'learning_rate': [0.05, 0.1]}
xgb_grid = GridSearchCV(xgbr, xgb_params, cv=tscv, scoring='neg_root_mean_squared_error', n_jobs=-1)
xgb_grid.fit(X_train, y_train)
best_xgb = xgb_grid.best_estimator_

# --- 5. Evaluation on the Hold-Out Test Set ---
rf_pred = best_rf.predict(X_test)
xgb_pred = best_xgb.predict(X_test)

rf_rmse = np.sqrt(mean_squared_error(y_test, rf_pred))
xgb_rmse = np.sqrt(mean_squared_error(y_test, xgb_pred))

print("\n--- Model Performance ---")
print(f"Random Forest Test RMSE: {rf_rmse:.4f}")
print(f"XGBoost Test RMSE: {xgb_rmse:.4f}")

# Visualize predictions
plt.figure(figsize=(14, 7))
plt.plot(y_test.index, y_test, label='Actual Volatility', color='black', alpha=0.7)
plt.plot(y_test.index, rf_pred, label=f'Random Forest Forecast (RMSE={rf_rmse:.4f})', color='blue', linestyle='--')
plt.plot(y_test.index, xgb_pred, label=f'XGBoost Forecast (RMSE={xgb_rmse:.4f})', color='red', linestyle=':')
plt.title('21-Day Volatility Forecast vs. Actual')
plt.legend()
plt.show()

# --- 6. Interpretation: Feature Importance ---
rf_importances = pd.Series(best_rf.feature_importances_, index=features).sort_values(ascending=False)
xgb_importances = pd.Series(best_xgb.feature_importances_, index=features).sort_values(ascending=False)

fig, ax = plt.subplots(1, 2, figsize=(16, 6))
rf_importances.plot(kind='barh', ax=ax[0], title='Random Forest Feature Importance')
xgb_importances.plot(kind='barh', ax=ax[1], title='XGBoost Feature Importance')
plt.tight_layout()
plt.show()

Interpreting the Final Results

After running the project, a quant would summarize their findings for the trading desk:

Model Performance: "Both models demonstrate predictive power on the out-of-sample test set. The XGBoost model achieved a slightly lower RMSE, indicating it is marginally more accurate for this task."
Feature Importance Insights: "The feature importance plots from both models are in strong agreement. The most important predictor by far is `past_vol_21d` (the most recent month's volatility). This confirms the well-known financial phenomenon of volatility clustering. Past returns have some, but much less, predictive power."
Recommendation: "Given its slightly better performance and speed, we recommend deploying the tuned XGBoost model. The forecast from this model can be used as a primary input for our daily options pricing and risk management systems."

Congratulations! You Have Completed Module 4

You have now completed a comprehensive study of ensemble methods, the most powerful and widely-used class of models for tabular data.

You have not only mastered the theory of Bagging and Boosting but have applied them to a challenging, real-world financial forecasting problem. You understand how to engineer features for time series, how to validate your models correctly, and how to interpret their results to provide actionable insights.

What's Next in Your Journey?

Ensemble methods are the kings of prediction on structured, tabular data. But what about unstructured data? How do we find patterns when our input isn't a neat table of numbers, but a massive block of text from an earnings report, or a raw image from a satellite?

For these problems, we need a new toolkit. In **Module 5: Finding Structure: Unsupervised Learning**, we will explore the world of data without labels, learning how to automatically discover groups with Clustering and simplify data with Principal Component Analysis (PCA).

Comparing Bagging vs. Boosting: A Bias-Variance Perspective

The Goal of Unsupervised Learning: Clustering vs. Dimensionality Reduction