Lesson 5.6: Taming Wild Data: Unit Roots and the ARIMA Model

This lesson provides the final, crucial piece of our time series toolkit. We confront the reality that most financial and economic data is non-stationary. We will explore the concept of a 'unit root' process, master the technique of 'differencing' to achieve stationarity, and combine everything we've learned into the celebrated Autoregressive Integrated Moving Average (ARIMA) model.

Part 1: The Problem - The Random Walk

Our entire framework so far—AR, MA, and ARMA models—is built on the bedrock of stationarity. But what happens when we look at a typical stock price chart? It trends. Its mean is clearly not constant. Its variance can change. It is profoundly non-stationary.

The most fundamental model for such behavior is the **Random Walk**. This model posits that the best forecast for tomorrow's price is simply today's price, plus an unpredictable random shock.

The Random Walk Model

Y_t = Y_{t-1} + \epsilon_t

where $\epsilon_t$ is a white noise error term.

This is equivalent to an AR(1) model where the autoregressive coefficient $\phi_1$ is exactly equal to 1. When this happens, we say the process has a **unit root**.

Why a Unit Root Causes Non-Stationarity

A unit root process has "infinite memory." The effect of a shock never dies out. Let's see this by recursive substitution:

Y_t = Y_{t-1} + \epsilon_t

= (Y_{t-2} + \epsilon_{t-1}) + \epsilon_t

= (Y_{t-3} + \epsilon_{t-2}) + \epsilon_{t-1} + \epsilon_t

\vdots

= Y_0 + \sum_{i=1}^t \epsilon_i

The value today, $Y_t$ , is the starting value plus the sum of all shocks that have ever occurred. The variance of this sum grows with time:

\text{Var}(Y_t) = \text{Var}\left(\sum_{i=1}^t \epsilon_i\right) = \sum_{i=1}^t \text{Var}(\epsilon_i) = t \sigma^2

Since the variance, $t\sigma^2$ , is a function of time $t$ , the process violates the condition of constant variance and is therefore non-stationary.

Part 2: The Solution - The 'I' for Integrated

We cannot build an ARMA model on a random walk. But what if we could transform the data to make it stationary? The solution is beautifully simple: **differencing**.

If we look at our random walk equation, $Y_t = Y_{t-1} + \epsilon_t$ , and simply rearrange it, we get:

Y_t - Y_{t-1} = \epsilon_t

The *difference* between consecutive values is just a white noise error term, which is, by definition, stationary! By modeling the *change* in the series instead of its *level*, we have moved from quicksand to bedrock.

The Concept of Integration

A time series that can be made stationary by differencing it $d$ times is said to be **integrated of order d**, denoted $I(d)$ .

A stationary series is $I(0)$ (it needs 0 differences).
A random walk is $I(1)$ (it needs 1 difference). Most financial prices are $I(1)$ .
A series that trends quadratically might be $I(2)$ . This is rare in finance.

The "I" in ARIMA stands for Integrated. It is the part of the model that handles non-stationarity by applying differencing.

Part 3: The Full ARIMA(p,d,q) Model

We can now put everything together. An Autoregressive Integrated Moving Average model, or ARIMA(p,d,q), is simply an ARMA(p,q) model applied to a differenced time series.

The ARIMA(p,d,q) Model

The model is defined by three orders:

$p$ : The order of the Autoregressive component (lags of the differenced series).
$d$ : The degree of differencing required to make the series stationary.
$q$ : The order of the Moving Average component (lags of the error terms).

If we let $Y'_t$ be the differenced series (e.g., $Y'_t = Y_t - Y_{t-1}$ for $d=1$ ), then the model is just an ARMA(p,q) on $Y'_t$ :

Y'_t = c + \phi_1 Y'_{t-1} + \dots + \phi_p Y'_{t-p} + \theta_1 \epsilon_{t-1} + \dots + \theta_q \epsilon_{t-q} + \epsilon_t

Part 4: The Complete Modeling Workflow (The Box-Jenkins Methodology)

Building a robust ARIMA model is a systematic, iterative process. This famous methodology, developed by George Box and Gwilym Jenkins, is the gold standard for practitioners.

The Box-Jenkins Workflow

Step 1: Identification

Plot the time series. Check for trends, seasonality, and non-constant variance.
Use the Augmented Dickey-Fuller (ADF) test to formally test for a unit root.
If the series is non-stationary ( $p$ -value &gt 0.05), take the first difference. Repeat the ADF test on the differenced series. If it's still non-stationary, difference again (this is rare). The number of times you differenced is your $d$ .
Plot the ACF and PACF of the final, stationary (differenced) series to get initial guesses for $p$ and $q$ based on their cutoff/decay patterns.

Step 2: Estimation

Fit your chosen ARIMA(p,d,q) model to the original, non-stationary data. The software will handle the differencing internally.
Use AIC/BIC to compare a few plausible candidate models (e.g., ARIMA(1,1,1), ARIMA(2,1,0), ARIMA(0,1,1)) and select the one with the lowest value.

Step 3: Diagnostic Checking

Examine the residuals of your best-fit model. The residuals must be white noise.
Plot the ACF of the residuals. There should be no significant spikes.
Use a formal test like the Ljung-Box test, where the null hypothesis is that the residuals are independently distributed. You want a high p-value for this test.
If the residuals are not white noise, your model is misspecified. Go back to Step 1 and try a different model structure.

Step 4: Forecasting

Once you have a validated model, use it to forecast future values. The software forecasts the differenced series and then "integrates" (un-differences) the results to convert them back to the original scale of your data.

Part 5: Python Implementation - A Full ARIMA Project

Modeling a Non-Stationary Series in Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima.model import ARIMA

# --- Generate a sample I(1) series (a random walk with drift) ---
np.random.seed(42)
n_samples = 1000
drift = 0.1
random_walk = drift + np.random.randn(n_samples).cumsum()
data = pd.Series(random_walk)

# --- Step 1: Identification ---
# 1a. Plot the data - we see a clear upward trend.
data.plot(title="Non-Stationary Data (Random Walk with Drift)")
plt.show()

# 1b. ADF test on original data
adf_result = adfuller(data)
print(f'ADF p-value (original): {adf_result[1]}')
# We expect a high p-value, confirming non-stationarity.

# 1c. Difference the data and re-test (d=1)
data_diff = data.diff().dropna()
adf_result_diff = adfuller(data_diff)
print(f'ADF p-value (differenced): {adf_result_diff[1]}')
# We expect a very low p-value, confirming stationarity. Our d=1.

# 1d. Plot ACF/PACF of differenced data to find p and q
fig, ax = plt.subplots(2, 1, figsize=(12, 8))
plot_acf(data_diff, ax=ax[0], title='ACF of Differenced Data')
plot_pacf(data_diff, ax=ax[1], title='PACF of Differenced Data')
plt.tight_layout()
plt.show()
# For a differenced random walk, the result is white noise.
# We expect both ACF and PACF to have no significant spikes. This suggests p=0, q=0.

# --- Step 2 & 3: Estimation and Diagnostics ---
# Based on our findings, we propose an ARIMA(0,1,0) model.
# This is a Random Walk with Drift model.
model = ARIMA(data, order=(0, 1, 0))
model_fit = model.fit()
print(model_fit.summary())

# Check residuals: they should be white noise.
model_fit.plot_diagnostics(figsize=(15,12))
plt.show()

# --- Step 4: Forecasting ---
forecast_steps = 50
forecast = model_fit.get_forecast(steps=forecast_steps)
forecast_mean = forecast.predicted_mean
forecast_ci = forecast.conf_int()

# Plot
plt.figure(figsize=(12, 6))
plt.plot(data.index[-200:], data.values[-200:], label='Observed')
plt.plot(forecast_mean.index, forecast_mean.values, color='red', label='Forecast')
plt.fill_between(forecast_ci.index, forecast_ci.iloc[:, 0], forecast_ci.iloc[:, 1], color='pink', alpha=0.5)
plt.title('ARIMA(0,1,0) Forecast (Random Walk with Drift)')
plt.legend()
plt.show()
# The forecast for a random walk is just a straight line with the slope of the drift.

What's Next? Formalizing the Workflow

We have just walked through the complete, end-to-end process of building an ARIMA model. This is the single most important workflow in classical time series analysis.

The four stages—Identification, Estimation, Diagnostic Checking, and Forecasting—are so fundamental that they have a name: the **Box-Jenkins Methodology**.

In the next lesson, we will formalize this workflow, providing a detailed practitioner's checklist and discussing the nuances of each stage, particularly the art and science of model selection using information criteria.

Up Next: The Practitioner's Guide: The Box-Jenkins Methodology