Lesson 6.6: Using ML for Time Series

We've engineered our time series features. Now, we apply a standard machine learning workflow—with one crucial, time-aware twist. This lesson explains how to properly frame a time series forecasting problem for an ML model like XGBoost, focusing on the correct way to perform a train-test split to avoid look-ahead bias.

Part 1: From Sequence to Tabular Data

Thanks to the feature engineering in our last lesson, we have successfully transformed our problem. What was once a single sequence of numbers is now a standard tabular dataset where each row is a time step, the columns are our engineered features, and the target is a future value.

Our problem is now Y^t+1=f(Yt,Yt1,,rolling_meant,)\hat{Y}_{t+1} = f(Y_t, Y_{t-1}, \dots, \text{rolling\_mean}_t, \dots). We can now apply any supervised learning model, such as XGBoost, to learn this function ff.

Part 2: The Critical Twist - The Time-Aware Split

In standard ML, we use `train_test_split` which **randomly shuffles** the data before splitting. For time series, this is a catastrophic error. It's like training a model to predict Monday's stock price using data from Wednesday, and then testing it on Tuesday. It completely destroys the temporal order and creates **look-ahead bias**, leading to unrealistically optimistic results.

The Walk-Forward Split

For time series, we must perform a simple **chronological split**. The training set must consist of all data *before* a certain point in time, and the test set must consist of all data *after* that point.

  • Train: All data from 2010 to 2020.
  • Test: All data from 2021 onwards.

Implementation

This is simpler than a random split. You just slice the DataFrame.

train_size = int(len(df) * 0.8)
train_df = df.iloc[:train_size]
test_df = df.iloc[train_size:]

What's Next? The Perils of Backtesting

We now have a complete workflow for applying ML to time series data. However, our simple train-test split is just the beginning. The process of evaluating a trading strategy on historical data is called **backtesting**, and it is filled with subtle traps and biases.

In our next lesson, we will explore the common pitfalls of backtesting, including look-ahead bias in more complex forms and the dangers of overfitting your strategy to the historical data (data snooping).