Useful links for plotting, etc:
Stochastic process = theoretical mathematical object: the value at each time step is a random variable with a probability distribution. Each of these figs is one realisation of that SP. Used as mathematical models of time series
Autocorrelation violates key assumption of many statistical models. For this reason TSA hares properties with the study of spatial data: many of the same statistics. Language too! Once you have a model that describes some data, natural to wonder if it is a good enough fit to the data to allow you to make predictions. For time series: this is the art of forecasting.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.graphics.tsaplots import plot_acf
import warnings
warnings.filterwarnings('ignore')
plt.rcParams.update({'font.size': 20})
# Increase default figure size
plt.rcParams['figure.figsize'] = (12, 6)
Sunspot data downloaded from SILSO (Monthly mean total sunspot number [1/1749 - now], CSV format)
df = pd.read_csv("tsa_lecture_files/SN_m_tot_V2.0.csv", header=None, delimiter=';',names=[
"year",
"month",
"decimal_date",
"SN",
"SN_std",
"nb_obs",
"provisional",
])
df["day"] = "1" # For creating a timestamp
df["SN"] = df["SN"].replace(-1, np.nan) # Replace -1 with NaN
df["Timestamp"] = pd.to_datetime(df[["year", "month", "day"]])
ts = df[["Timestamp", "SN"]].set_index("Timestamp")
ts = ts["1973":]
ts.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 628 entries, 1973-01-01 to 2025-04-01 Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SN 628 non-null float64 dtypes: float64(1) memory usage: 9.8 KB
ts.plot(ylabel="Sunspot Number", legend=False, xlabel="", figsize=(20, 4));
#plt.plot(ts) # No need to specify x and y
Let's make an interactive one
import plotly.express as px
fig = px.line(ts, x=ts.index, y="SN", height=400, width=1200)
fig.show()
Often looking to decompose into three parts:
Do we notice any of those here?
ts_smoothed = ts.rolling(window=12).mean()
ts_downsampled = ts.resample('YE').mean()
plt.figure(figsize=(20, 4))
plt.plot(ts, label="Original Time Series", alpha=0.5)
plt.plot(ts_downsampled, label="Annual Mean", marker='o')
plt.plot(ts_smoothed, label="12-Month Rolling Mean")
plt.ylabel("Sunspot Number")
plt.legend()
<matplotlib.legend.Legend at 0x25da4e49a50>
from statsmodels.tsa import stattools as tsa_funcs
white_noise = np.random.randn(1000)
plt.plot(white_noise)
[<matplotlib.lines.Line2D at 0x25da2a1c6d0>]
acf = tsa_funcs.acf(white_noise, nlags=len(ts), adjusted=True)
plt.plot(acf)
plt.axhline(0, color="black", lw=0.5, ls="--")
<matplotlib.lines.Line2D at 0x25da29d7430>
ACF of sunspots
acf = tsa_funcs.acf(ts, nlags=len(ts), adjusted=True)
plt.plot(acf)
plt.axhline(0, color='black', lw=0.5, ls='--')
<matplotlib.lines.Line2D at 0x25da2a5dcf0>
plt.plot(acf)
# Add vertical lines at lags corresponding to 11 years
plt.axvline(12*10.3, color='red', label = "11 year cycle")
plt.legend()
<matplotlib.legend.Legend at 0x25da4f6f220>
As we know from Fourier in 1822, "any signal can be decomposed into a sum of periodic components". Computed using efficient FFT. However, requires evenly-spaced data. If not, could use Lomb-Scargle periodogram.
Run into issues from time to freq domain around edge effects, leakage. Requires things like windowing, pre-whitening, post-darkening.
PSD of white noise
fs = 1
f, Pxx_den = signal.periodogram(x=white_noise, fs=fs)
plt.loglog(f[1:], Pxx_den[1:])
[<matplotlib.lines.Line2D at 0x25da61a7c40>]
PSD of sunspots
s_in_month = (60*60*24*30)
fs = 1/s_in_month
f, Pxx_den = signal.periodogram(x=ts.SN, fs=fs)
plt.loglog(f[1:], Pxx_den[1:])
plt.axvline(1/(s_in_month*12*11), color='red', linestyle='--', label='1/(11 years)')
plt.legend()
plt.xlabel('frequency [Hz]')
plt.ylabel('PSD')
plt.show()
peak_f = f[Pxx_den==Pxx_den.max()]
peak_period_s = 1/peak_f
peak_period_years = peak_period_s / (60*60*24*365)
peak_period_years
array([10.32328767])
# Simple seasonal decomposition (assuming exactly 11-year cycle)
seasonal_period = int(12*11)
# Calculate seasonal component by averaging over all cycles
ts_clean = ts["SN"]
n_full_cycles = len(ts_clean) // seasonal_period
ts_truncated = ts_clean[:n_full_cycles * seasonal_period]
# Reshape to separate cycles and compute mean seasonal pattern
ts_reshaped = ts_truncated.values.reshape(-1, seasonal_period)
seasonal_pattern = np.mean(ts_reshaped, axis=0)
for i in range(ts_reshaped.shape[0]):
plt.plot(ts_reshaped[i], alpha=0.2, color='blue')
plt.plot(range(seasonal_period), seasonal_pattern, 'r-', linewidth=2, marker='o', markersize=3)
plt.title('11-Year Seasonal Pattern')
plt.xlabel('Months in Cycle')
plt.ylabel('Seasonal Component')
plt.grid(True, alpha=0.3)
# Extend seasonal pattern to match original series length
n_repeats = len(ts_clean) // seasonal_period + 1
seasonal_extended = np.tile(seasonal_pattern, n_repeats)[:len(ts_clean)]
# Remove seasonal component
ts_deseasonalized = ts_clean - seasonal_extended
plt.figure(figsize=(7, 12))
# Original series
plt.subplot(4, 1, 1)
plt.plot(ts_clean.index, ts_clean.values, 'b-', alpha=0.7)
plt.title('Original Time Series')
plt.axhline(0, color='black', lw=0.5, ls='--')
plt.ylim(-10,250)
plt.grid(True, alpha=0.3)
plt.subplot(4, 1, 2)
plt.plot(seasonal_extended, 'c-')
plt.title('- Average Seasonality')
plt.grid(True, alpha=0.3)
plt.ylim(-10,250)
# Deseasonalized series (Method 1)
plt.subplot(4, 1, 3)
plt.plot(ts_deseasonalized.index, ts_deseasonalized.values, 'g-', alpha=0.7)
plt.title('= De-seasonalized Time Series')
plt.axhline(0, color='black', lw=0.5, ls='--')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Verify seasonal removal with autocorrelation
fig, ax = plt.subplots(1,2,figsize=(10, 4))
acf_clean = tsa_funcs.acf(ts_clean, nlags=len(ts))
ax[0].plot(acf_clean)
ax[0].axhline(0, color='black', lw=0.5, ls='--')
ax[0].set_title("ACF (Original)")
acf_deseason = tsa_funcs.acf(ts_deseasonalized, nlags=len(ts))
ax[1].plot(acf_deseason)
ax[1].axhline(0, color="black", lw=0.5, ls="--")
ax[1].set_title("ACF (Deseasonalized)")
plt.tight_layout()
plt.show()
ts.diff(1).plot(title="Differenced Sunspot Number");
Diagram from Manu H
We can demonstrate this by simulating a random walk process, also known as Brownian motion or a Wiener process:
for i in range(1):
white_noise = np.random.randn(1000)
rw = np.cumsum(white_noise)
plt.plot(rw)
plt.axhline(0, c="black", linestyle=":")
Random walk example of a non-ergodic process: expectation is 0, but mean and variance can vary wildly. Divergence rather than convergence as $T\rightarrow\infty$
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
WARNING:tensorflow:From c:\Users\spann\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.
# Prepare data for forecast models
scaler = MinMaxScaler()
ts_scaled = scaler.fit_transform(ts.dropna().values.reshape(-1, 1))
# Create sequences using sliding window approach
# creating many-to-one sequences with y being the last value
def create_sequences(data, seq_length):
X, y = [], []
for i in range(seq_length, len(data)):
X.append(data[i - seq_length : i, 0])
y.append(data[i, 0])
return np.array(X), np.array(y)
seq_length = 24 # Each sequence will contain 20 time steps (2 years of monthly data)
X, y = create_sequences(ts_scaled, seq_length)
# Split data
train_size = int(0.8 * len(X))
X_train, X_test = X[:train_size], X[train_size:]
y_train, y_test = y[:train_size], y[train_size:]
# Reshape for LSTM
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))
from statsmodels.tsa.arima.model import ARIMA
# Unscaled series for ARIMA
ts_values = ts.dropna().values
# Align ARIMA training set with LSTM's by using same cutoff index (seq_length + train_size)
split_point = seq_length + train_size
arima_train, arima_test = ts_values[:split_point], ts_values[split_point:]
# Fit ARIMA on training data
arima_model = ARIMA(arima_train, order=(2, 1, 2))
arima_result = arima_model.fit()
# Forecast same number of steps as test set
arima_pred = arima_result.forecast(steps=len(y_test)).reshape(-1, 1)
arima_pred_original = arima_pred # Already in original scale
# Build LSTM model
model_lstm = Sequential([
LSTM(50, return_sequences=True, input_shape=(seq_length, 1)),
LSTM(50),
Dense(1)
])
model_lstm.compile(optimizer='adam', loss='mse')
# Train LSTM
model_lstm.fit(X_train, y_train, epochs=100, batch_size=32, verbose=0)
# Forecast with LSTM
lstm_pred = model_lstm.predict(X_test)
lstm_pred_original = scaler.inverse_transform(lstm_pred)
y_test_original = scaler.inverse_transform(y_test.reshape(-1, 1))
WARNING:tensorflow:From c:\Users\spann\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\src\layers\rnn\lstm.py:148: The name tf.executing_eagerly_outside_functions is deprecated. Please use tf.compat.v1.executing_eagerly_outside_functions instead. WARNING:tensorflow:From c:\Users\spann\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\src\optimizers\__init__.py:309: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead. WARNING:tensorflow:From c:\Users\spann\AppData\Local\Programs\Python\Python310\lib\site-packages\keras\src\utils\tf_utils.py:492: The name tf.ragged.RaggedTensorValue is deprecated. Please use tf.compat.v1.ragged.RaggedTensorValue instead. 4/4 [==============================] - 1s 9ms/step
Uses a model for the autocorrelation via a covariance matrix or kernel
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
# Flatten sequence input for GPR (LSTM uses 3D, GPR expects 2D)
X_train_gpr = X_train.reshape((X_train.shape[0], X_train.shape[1]))
X_test_gpr = X_test.reshape((X_test.shape[0], X_test.shape[1]))
# Define kernel and train GPR
kernel = C(1.0, (1e-3, 1e3)) * RBF(length_scale=1.0)
gpr = GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10, normalize_y=True)
gpr.fit(X_train_gpr, y_train)
gpr_pred = gpr.predict(X_test_gpr).reshape(-1, 1)
gpr_pred_original = scaler.inverse_transform(gpr_pred)
# Get confidence intervals
gpr_pred_mean, gpr_pred_std = gpr.predict(X_test_gpr, return_std=True)
gpr_pred_mean_original = scaler.inverse_transform(gpr_pred_mean.reshape(-1, 1))
gpr_pred_std_original = scaler.inverse_transform(gpr_pred_std.reshape(-1, 1))
# Plot LSTM results
plt.figure(figsize=(10, 4))
test_index = ts.index[train_size+24:]
plt.plot(ts[-500:], label='Training data', alpha=0.5, c="black")
plt.plot(test_index, y_test_original, label='Test data', color='black')
plt.plot(test_index, lstm_pred_original, label='LSTM Forecast', color='red')
plt.plot(test_index, arima_pred_original, label='ARIMA Forecast', color='green')
plt.plot(test_index, gpr_pred_original, label='GPR Forecast with 95% CI', color='orange')
plt.fill_between(
ts.index[-len(y_test) :],
gpr_pred_mean_original.flatten() - 1.96 * gpr_pred_std_original.flatten(),
gpr_pred_mean_original.flatten() + 1.96 * gpr_pred_std_original.flatten(),
color="orange",
alpha=0.2,
)
plt.legend()
plt.title('LSTM Forecast')
plt.show()
import plotly.graph_objects as go
# Define x-axis values
test_index = ts.index[train_size + 24 :]
fig = go.Figure()
# Plot Training Data
fig.add_trace(
go.Scatter(
x=ts.index[-500:],
y=ts[-500:].values.flatten(),
mode="lines",
name="Training data",
line=dict(color="black", width=1),
opacity=0.5,
)
)
# Plot Test Data
fig.add_trace(
go.Scatter(
x=test_index,
y=y_test_original.flatten(),
mode="lines",
name="Test data",
line=dict(color="black"),
)
)
# LSTM Forecast
fig.add_trace(
go.Scatter(
x=test_index,
y=lstm_pred_original.flatten(),
mode="lines",
name="LSTM Forecast",
line=dict(color="red"),
)
)
# ARIMA Forecast
fig.add_trace(
go.Scatter(
x=test_index,
y=arima_pred_original.flatten(),
mode="lines",
name="ARIMA Forecast",
line=dict(color="green"),
)
)
# GPR Forecast
fig.add_trace(
go.Scatter(
x=test_index,
y=gpr_pred_original.flatten(),
mode="lines",
name="GPR Forecast with 95% CI",
line=dict(color="orange"),
)
)
# GPR Confidence Interval
fig.add_trace(
go.Scatter(
x=ts.index[-len(y_test) :],
y=gpr_pred_mean_original.flatten() + 1.96 * gpr_pred_std_original.flatten(),
fill=None,
mode="lines",
line=dict(color="orange", width=0),
showlegend=False,
)
)
fig.add_trace(
go.Scatter(
x=ts.index[-len(y_test) :],
y=gpr_pred_mean_original.flatten() - 1.96 * gpr_pred_std_original.flatten(),
fill="tonexty",
mode="lines",
line=dict(color="orange", width=0),
showlegend=False,
opacity=0.2,
)
)
# Show plot
fig.show()
# Calculate error metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
def calculate_metrics(y_true, y_pred):
mse = mean_squared_error(y_true, y_pred)
mae = mean_absolute_error(y_true, y_pred)
return mse, mae
mse_lstm, mae_lstm = calculate_metrics(y_test_original, lstm_pred_original)
mse_arima, mae_arima = calculate_metrics(y_test_original, arima_pred_original)
mse_gpr, mae_gpr = calculate_metrics(y_test_original, gpr_pred_original)
print(f"LSTM - MSE: {mse_lstm:.2f}, MAE: {mae_lstm:.2f}")
print(f"ARIMA - MSE: {mse_arima:.2f}, MAE: {mae_arima:.2f}")
print(f"GPR - MSE: {mse_gpr:.2f}, MAE: {mae_gpr:.2f}")
LSTM - MSE: 230.18, MAE: 10.41 ARIMA - MSE: 2987.62, MAE: 43.83 GPR - MSE: 4154.32, MAE: 57.84
Solar wind is a stream of charged particles that flows from the Sun. We are interested in its turbulent properties, which requires time series analysis of its magnetic field as measured by spacecraft.
psp = pd.read_pickle("tsa_lecture_files/solar_wind_magnetic_field_psp_2018.pkl")
psp.plot(ylabel="$B_R$ (nT)", figsize=(15, 3), legend=False)
plt.title("PSP Magnetic Field Data")
Text(0.5, 1.0, 'PSP Magnetic Field Data')
acf = tsa_funcs.acf(psp, nlags=len(psp), adjusted=True)
plt.plot(acf)
plt.axhline(0, color="black", lw=0.5, ls="--")
<matplotlib.lines.Line2D at 0x25dc7d4e3e0>
f, Pxx_den = signal.periodogram(x=psp, fs=1)
plt.loglog(f[1:], Pxx_den[1:])
# Add -5/3 slope line
x = np.log10(f[1:])
y = -5 / 3 * x
plt.plot(10**x, 10**y/10, color='red', linestyle='--', label='Theoretical turbulent scaling law')
plt.legend()
plt.xlabel("frequency [Hz]")
plt.ylabel("PSD")
plt.show()
We can quantify the level of noise with the Signal-to-Noise Ratio (SNR):
$$SNR=\frac{\text{Var}(\text{signal})}{\text{Var}\text{(noise)}}=\frac{\text{Var}(X)}{\sigma_{\epsilon}^2}$$What happens to the power spectrum if we add noise?
psp_with_noise = psp + np.random.normal(0, 1, size=len(psp))
plt.plot(psp_with_noise, label='Time Series with Noise', color='purple', alpha=0.5)
plt.plot(psp, label='Original Time Series', color='blue')
plt.legend()
<matplotlib.legend.Legend at 0x25dc7fd3b80>
Effect of noise on power spectrum
f, Pxx_den = signal.periodogram(x=psp, fs=1)
f, Pxx_den_noise = signal.periodogram(x=psp_with_noise, fs=1)
f, Pxx_den_noise_only = signal.periodogram(x=psp_with_noise - psp, fs=1)
plt.loglog(f[1:], Pxx_den[1:], label="Original", color='black', alpha=0.5)
plt.loglog(f[1:], Pxx_den_noise[1:], label="With Noise", color="black")
plt.loglog(f[1:], Pxx_den_noise_only[1:], label="Noise Only", color='red', alpha=0.3)
plt.legend()
plt.xlabel("frequency [Hz]")
plt.ylabel("PSD")
plt.show()
This gives us the concept of an noise floor, beyond which no meaningful interpretation is possible. Additional limit to Nyquist frequency.
We have assumed the following here:
$$X_{obs}(t)=X_{signal}+\epsilon$$$$\epsilon\sim N(0,\sigma^2_{n})$$Important consideration is many cases when noise is not stationary or Gaussian, i.e. not white noise, e.g. in gravitational wave astronomy.