时间序列数据还需要不同类型的交叉验证,因为它必须遵循日期的顺序。PyCaret 回归模块在评估模型时默认使用 k 折随机交叉验证。此默认交叉验证设置不适用于时间序列数据。
本教程的以下部分将演示如何轻松更改 PyCaret 回归模块中的默认设置,使其适用于时间序列数据。
👉 数据集
**# read csv file
**import pandas as pd
data = pd.read_csv('AirPassengers.csv')
data['Date'] = pd.to_datetime(data['Date'])
data.head()
**# create 12 month moving average
**data['MA12'] = data['Passengers'].rolling(12).mean()
**# plot the data and MA
**import plotly.express as px
fig = px.line(data, x="Date", y=["Passengers", "MA12"], template = 'plotly_dark')
fig.show()
由于算法无法直接处理日期,让我们从日期中提取一些简单特征,如月份和年份,并删除原始日期列。
**# extract month and year from dates**
data['Month'] = [i.month for i in data['Date']]
data['Year'] = [i.year for i in data['Date']]
**# create a sequence of numbers
**data['Series'] = np.arange(1,len(data)+1)
**# drop unnecessary columns and re-arrange
**data.drop(['Date', 'MA12'], axis=1, inplace=True)
data = data[['Series', 'Year', 'Month', 'Passengers']]
**# check the head of the dataset**
data.head()
**# split data into train-test set
**train = data[data['Year'] < 1960]
test = data[data['Year'] >= 1960]
**# check shape
**train.shape, test.shape
>>> ((132, 4), (12, 4))
基于交叉验证 MAE 的最佳模型是**最小角回归** (MAE: 22.3)。让我们检查一下测试集上的得分。
prediction_holdout = predict_model(best);
测试集上的 MAE 比交叉验证 MAE 高出 12%。不太好,但我们将基于此继续。让我们绘制实际和预测线来可视化拟合情况。
**# generate predictions on the original dataset**
predictions = predict_model(best, data=data)
**# add a date column in the dataset**
predictions['Date'] = pd.date_range(start='1949-01-01', end = '1960-12-01', freq = 'MS')
**# line plot**
fig = px.line(predictions, x='Date', y=["Passengers", "Label"], template = 'plotly_dark')
**# add a vertical rectange for test-set separation**
fig.add_vrect(x0="1960-01-01", x1="1960-12-01", fillcolor="grey", opacity=0.25, line_width=0)
fig.show()
future_dates = pd.date_range(start = '1961-01-01', end = '1965-01-01', freq = 'MS')
future_df = pd.DataFrame()
future_df['Month'] = [i.month for i in future_dates]
future_df['Year'] = [i.year for i in future_dates]
future_df['Series'] = np.arange(145,(145+len(future_dates)))
future_df.head()