Lesson 6.9: Capstone: Building a Pairs Trading Strategy with Cointegration

This capstone lesson is where theory meets practice. We will synthesize everything we have learned about stationarity, cointegration, and error correction to build a complete, end-to-end quantitative pairs trading strategy. We will walk through the entire workflow, from finding a cointegrated pair to defining trading rules and visualizing the backtest.

Part 1: The Philosophy - Betting on Mean Reversion

Pairs trading is one of the oldest and most famous quantitative, market-neutral strategies. Its goal is to be profitable regardless of the overall market's direction. It is not a bet on whether the market will go up or down, but a bet on the stability of a relationship between two assets.

The entire strategy is a direct application of the concept of **cointegration**. We find two assets that are "leashed" together by a long-run economic relationship. We then monitor the "spread" between them. When the spread widens to an extreme, we place a bet that it will revert to its historical mean.

The Strategy at a Glance

Hypothesis: Two stocks, $A$ and $B$ , are cointegrated, meaning their price spread, $\text{Spread} = P_A - \beta P_B$ , is stationary and mean-reverting.
The Bet: If the spread becomes unusually large (e.g., $A$ has outperformed $B$ ), we bet on convergence. We **short the spread** by shorting stock A and buying $\beta$ units of stock B.
The Payoff: If the spread reverts to its mean, our combined position will be profitable, regardless of whether the whole market went up or down.
The Risk: The primary risk is a **structural break**—the historical relationship breaks down, and the spread widens indefinitely instead of reverting.

Part 2: The Quant Workflow - A Step-by-Step Guide

Building a robust pairs trading strategy is a systematic process. We will break it down into four distinct phases: Formation, Trading, and Evaluation.

Phase 1: The Formation Period (Finding a Pair)

Identify Candidate Pairs: Start with economically sensible pairs. These are often competitors in the same industry (e.g., Coca-Cola vs. Pepsi; Ford vs. GM) or assets that track a similar underlying factor (e.g., Gold vs. a Gold Miners ETF; two different Emerging Market ETFs).
Define a Formation Period: Select a historical time window to test for cointegration and establish the relationship. For example, we might use all data from 2010-2020 as our formation period. This data is used *only* for setting up the strategy and is kept separate from our backtesting period.
Test for Cointegration: Use the Engle-Granger test (or the more advanced Johansen test) on the prices of the two assets over the formation period. If the p-value is low (< 0.05), we conclude they are a cointegrated pair.
Estimate the Hedge Ratio: If cointegration is found, run an OLS regression of one price on the other to find the cointegrating vector (the hedge ratio, $\hat{\beta}$ ). $P_A = \alpha + \beta P_B + \epsilon$ .
Calculate the Spread: Compute the historical spread series over the formation period: $\text{Spread}_t = P_{A,t} - \hat{\beta} P_{B,t}$ .

Phase 2: The Trading Period (Generating Signals)

Define the Trading Period: Select a subsequent, out-of-sample time window to backtest the strategy. For example, 2021-2023.
Calculate the Live Spread: Using the $\hat{\beta}$ from the formation period, calculate the spread for each day in the trading period.
Normalize the Spread (Z-Score): To create comparable trading signals, we normalize the spread by calculating its Z-score. The Z-score tells us how many standard deviations the current spread is from its historical mean.
$Z_t = \frac{\text{Spread}_t - \mu_{\text{spread}}}{\sigma_{\text{spread}}}$
where $\mu_{\text{spread}}$ and $\sigma_{\text{spread}}$ are the mean and standard deviation of the spread from the **formation period**.
Define Entry/Exit Rules: Set thresholds for the Z-score to trigger trades. A common choice is:
- Entry Signal (Short): If $Z_t > 2.0$ , short the spread (sell A, buy B).
- Entry Signal (Long): If $Z_t < -2.0$ , go long the spread (buy A, sell B).
- Exit Signal: Close the position when the Z-score crosses back over zero.

Part 3: The Complete Python Implementation

We will now implement this entire workflow in Python, using two highly correlated Canadian bank stocks: Bank of Montreal (BMO) and Bank of Nova Scotia (BNS).

End-to-End Pairs Trading Backtest

import pandas as pd
import numpy as np
import yfinance as yf
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import coint
import statsmodels.api as sm

# --- Phase 1: Formation Period ---

# 1. Define assets and time periods
asset1_ticker = 'BMO' # Bank of Montreal
asset2_ticker = 'BNS' # Bank of Nova Scotia
formation_start = '2015-01-01'
formation_end = '2020-12-31'
trading_start = '2021-01-01'
trading_end = '2023-12-31'

# 2. Download data
df = yf.download([asset1_ticker, asset2_ticker], start=formation_start, end=trading_end)['Adj Close']
df.columns = ['asset1', 'asset2']

# 3. Split data into formation and trading periods
formation_df = df.loc[formation_start:formation_end]
trading_df = df.loc[trading_start:trading_end]

# 4. Test for cointegration in the formation period
coint_result = coint(formation_df['asset1'], formation_df['asset2'])
p_value = coint_result[1]
print(f"Formation Period Cointegration Test p-value: {p_value:.4f}")
if p_value > 0.05:
    print("Warning: Series may not be cointegrated. Strategy is not advised.")

# 5. Estimate the hedge ratio (beta)
X_formation = sm.add_constant(formation_df['asset2'])
model = sm.OLS(formation_df['asset1'], X_formation).fit()
beta = model.params['asset2']
print(f"Estimated Hedge Ratio (beta): {beta:.4f}")

# 6. Calculate the spread in the formation period and its statistics
spread_formation = formation_df['asset1'] - beta * formation_df['asset2']
spread_mean = spread_formation.mean()
spread_std = spread_formation.std()

# --- Phase 2: Trading Period ---

# 7. Calculate the live spread in the trading period
spread_trading = trading_df['asset1'] - beta * trading_df['asset2']

# 8. Calculate the Z-score of the live spread
z_score_trading = (spread_trading - spread_mean) / spread_std

# 9. Visualize the Z-score and trading thresholds
plt.figure(figsize=(14, 7))
z_score_trading.plot(label='Z-Score')
plt.axhline(2.0, color='red', linestyle='--', label='Short Entry Threshold (+2σ)')
plt.axhline(-2.0, color='green', linestyle='--', label='Long Entry Threshold (-2σ)')
plt.axhline(0.0, color='black', linestyle='-', label='Exit Threshold (Mean)')
plt.title('Z-Score of the Spread (Trading Period)')
plt.legend()
plt.show()

# --- Phase 3: Simple Backtest & Evaluation (Conceptual) ---
# A full backtest requires a proper event-driven backtesting engine.
# Here we will just calculate the conceptual P&L.

# 10. Generate signals and positions
positions = pd.DataFrame(index=z_score_trading.index, columns=['position_A', 'position_B'])
positions['z_score'] = z_score_trading
# 1 for long spread, -1 for short spread, 0 for flat
positions['signal'] = 0
positions.loc[positions['z_score'] > 2.0, 'signal'] = -1
positions.loc[positions['z_score'] < -2.0, 'signal'] = 1
# Exit when z-score crosses zero
positions['signal'] = positions['signal'].ffill().fillna(0)
# Flatten position when z_score crosses zero
positions.loc[(positions['z_score'].shift(1) > 0) & (positions['z_score'] < 0), 'signal'] = 0
positions.loc[(positions['z_score'].shift(1) < 0) & (positions['z_score'] > 0), 'signal'] = 0

# 11. Calculate daily returns of the strategy
daily_returns = trading_df.pct_change()
strategy_returns = (positions['signal'].shift(1) * daily_returns['asset1']) - (positions['signal'].shift(1) * beta * daily_returns['asset2'])

# 12. Plot cumulative returns
cumulative_returns = (1 + strategy_returns).cumprod()
plt.figure(figsize=(14, 7))
cumulative_returns.plot(label='Pairs Trading Strategy')
plt.title('Cumulative Strategy Returns')
plt.ylabel('Cumulative Return')
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

Part 4: Real-World Considerations and Risks

While this simple backtest is illustrative, a real-world implementation would need to be far more robust.

Transaction Costs: Every trade incurs commissions and bid-ask spread costs. These can significantly erode the profitability of a high-frequency strategy.
Look-ahead Bias: Our simple backtest uses the mean and standard deviation from the full formation period to calculate the Z-score. A more robust approach would use a rolling window to calculate these statistics to avoid using future information.
Risk of Structural Break: The cointegrating relationship can break down. The "leash" can snap. For example, one company might be acquired, or a new technology might fundamentally change their business models, causing their prices to diverge permanently. This is the single greatest risk to a pairs trading strategy.
Model Selection: Finding genuinely cointegrated pairs that are also volatile enough to provide trading opportunities is a difficult data mining problem. Many statistically significant relationships are not economically meaningful or stable.

Conclusion of Module 6 and The Path Forward

You have done it. You have progressed from the fundamental axioms of probability all the way to designing, building, and backtesting a complete, market-neutral quantitative trading strategy. You have mastered the tools of both univariate and multivariate time series analysis and understand the deep connection between economic theory and statistical modeling.

This capstone project is not an end, but a beginning. It is the foundation upon which all more advanced quantitative and machine learning strategies are built. The skills you have acquired in these six modules are the essential toolkit for any serious practitioner in the field.

The journey ahead involves exploring non-linear models, incorporating machine learning techniques for signal generation, and diving deeper into the nuances of risk management and execution. You are now fully equipped to begin that journey.

Lesson 6.8: Putting It Together: The Vector Error Correction Model (VECM)