Lesson 6.5: Feature Engineering for Time Series
We've seen how classical models like ARIMA use the inherent structure of a time series. This lesson shows how to manually extract that structure to create predictive features for machine learning models. We'll cover the two main techniques: creating lagged features and rolling window statistics.
Part 1: The Challenge - ML Models Have No Memory
Models like Linear Regression or XGBoost are "static." They have no built-in concept of time or order. If you feed them a simple time series of stock prices, they can't learn from the sequence. If you shuffle the rows of the dataset, the model will produce the exact same result.
Our job in feature engineering is to **encode the time-dependent information into the features themselves**. We need to transform our problem from a sequence problem into a standard tabular regression problem.
Part 2: The Two Core Techniques
Technique 1: Lag Features
This is the most fundamental technique. We create new columns in our dataset that represent the value of a variable from a previous time step.
If our target is to predict , our features would be . This explicitly gives the model the "rear-view mirror" information that an AR model would learn automatically.
Technique 2: Rolling Window Features
This technique creates features that summarize the behavior of a variable over a recent period. It's a way of capturing local trends and momentum.
For example, we can create:
- Rolling Mean: The average return over the last 5 days.
- Rolling Standard Deviation: The volatility over the last 21 days.
- Rolling Min/Max: The highest and lowest price over the last 63 days.
Part 3: Python Implementation
import pandas as pd
import numpy as np
# Create a sample time series of returns
np.random.seed(42)
returns = pd.Series(np.random.randn(200) / 100, name='returns')
df = pd.DataFrame(returns)
# --- Lag Features ---
# Create lags of the returns to use as predictors
for i in range(1, 6):
df[f'lag_{i}'] = df['returns'].shift(i)
# --- Rolling Window Features ---
# Create rolling statistics over different windows
windows = [5, 21, 63]
for w in windows:
# Rolling Mean (Momentum)
df[f'rolling_mean_{w}'] = df['returns'].rolling(window=w).mean()
# Rolling Std Dev (Volatility)
df[f'rolling_std_{w}'] = df['returns'].rolling(window=w).std()
# --- Date-Based Features ---
df.index = pd.to_datetime(pd.date_range(start='2023-01-01', periods=len(df)))
df['day_of_week'] = df.index.dayofweek # Monday=0, Sunday=6
df['month'] = df.index.month
# Drop NaNs created by lags and rolling windows
df.dropna(inplace=True)
# Define target variable (e.g., predict next day's return)
df['target'] = df['returns'].shift(-1)
df.dropna(inplace=True)
# Now you can use the 'lag' and 'rolling' columns as features (X) to predict 'target' (y)
print(df.head())
What's Next? Framing the Forecasting Problem
We have now learned how to transform a time series into a rich feature set that a static ML model can understand.
The next step is to put this into practice. We will formalize the process of using these features to train a model like XGBoost for a real forecasting task. This involves not just feature engineering, but also a crucial change to how we perform our train-test split to avoid "looking into the future."