Lesson 5.1: Feature Engineering for Time Series
Creating lagged, rolling, and date-based features to prepare sequential data for machine learning models.
The Challenge: ML Models Have No Memory
Models like XGBoost or Linear Regression are "static." They have no built-in concept of time or order. Our job is to **encode the time-dependent information into the features themselves**.
The Core Techniques
Technique 1: Lag Features
We create new columns that represent the value of a variable from a previous time step. If our target is , our features would be . This gives the model an explicit "rear-view mirror."
Technique 2: Rolling Window Features
This technique creates features that summarize the behavior of a variable over a recent period, capturing local trends and momentum.
- Rolling Mean: The average return over the last 5 days.
- Rolling Standard Deviation: The volatility over the last 21 days.
Technique 3: Date-Based Features
Extracting information from the timestamp itself can capture seasonality.
- Day of the week.
- Month of the year.
- Quarter.
Python Implementation
import pandas as pd
# Create a sample time series
returns = pd.Series([0.01, -0.02, 0.03, ...], name='returns')
df = pd.DataFrame(returns)
# Lag Features
for i in range(1, 4):
df[f'lag_{i}'] = df['returns'].shift(i)
# Rolling Window Features
df['rolling_mean_5d'] = df['returns'].rolling(window=5).mean()
df['rolling_std_21d'] = df['returns'].rolling(window=21).std()
# Date-Based Features (assuming a DatetimeIndex)
df['day_of_week'] = df.index.dayofweek
What's Next? Validating Our Model
We now know how to transform a time series into a rich feature set for an ML model. But how do we train and test this model? A random train-test split is a catastrophic error for time-series data.
In the next lesson, we will learn about **Time Series Cross-Validation** and the "walk-forward" approach to prevent look-ahead bias.