Lesson 5.1: Feature Engineering for Time Series

Creating lagged, rolling, and date-based features to prepare sequential data for machine learning models.

The Challenge: ML Models Have No Memory

Models like XGBoost or Linear Regression are "static." They have no built-in concept of time or order. Our job is to **encode the time-dependent information into the features themselves**.

The Core Techniques

Technique 1: Lag Features

We create new columns that represent the value of a variable from a previous time step. If our target is $Y_t$ , our features would be $Y_{t-1}, Y_{t-2}, \dots$ . This gives the model an explicit "rear-view mirror."

Technique 2: Rolling Window Features

This technique creates features that summarize the behavior of a variable over a recent period, capturing local trends and momentum.

Rolling Mean: The average return over the last 5 days.
Rolling Standard Deviation: The volatility over the last 21 days.

Technique 3: Date-Based Features

Extracting information from the timestamp itself can capture seasonality.

Day of the week.
Month of the year.
Quarter.

Python Implementation

import pandas as pd

# Create a sample time series
returns = pd.Series([0.01, -0.02, 0.03, ...], name='returns')
df = pd.DataFrame(returns)

# Lag Features
for i in range(1, 4):
    df[f'lag_{i}'] = df['returns'].shift(i)

# Rolling Window Features
df['rolling_mean_5d'] = df['returns'].rolling(window=5).mean()
df['rolling_std_21d'] = df['returns'].rolling(window=21).std()

# Date-Based Features (assuming a DatetimeIndex)
df['day_of_week'] = df.index.dayofweek

What's Next? Validating Our Model

We now know how to transform a time series into a rich feature set for an ML model. But how do we train and test this model? A random train-test split is a catastrophic error for time-series data.

In the next lesson, we will learn about **Time Series Cross-Validation** and the "walk-forward" approach to prevent look-ahead bias.

Capstone: A Pairs Trading Strategy

Time Series Cross-Validation