Зміст курсу
Data Preprocessing
Data Preprocessing
Train/Test Split & Cross Validation
The final topic on time series will be preparing data for training and testing machine learning models: train/test split and cross-validation.
When splitting time series data into training and testing sets, it is important to consider the temporal aspect of the data. Unlike in other types of datasets, random sampling for splitting is not appropriate for time series data since it can lead to data leakage and biased evaluation of the model's performance.
The most common method for splitting time series data is to use a fixed point in time as the split point between the training and testing sets. The training set includes all the observations before the split point, while the testing set includes all the observations after the split point.
For example, it can look like this:
import statsmodels.api as sm import pandas as pd # Load the dataset df = sm.datasets.get_rdataset('weather', 'nycflights13').data df['observation_time'] = pd.to_datetime(df.time_hour) df.drop(columns=['year', 'month', 'day', 'hour', 'time_hour'], inplace=True) print(df.head(10)) # Split data into training and test sets based on time train = df.loc[df['observation_time'] < '2013-08-01'] test = df.loc[df['observation_time'] >= '2013-08-01']
Cross-validation works on the same idea - split the training set into two parts (as before) at each iteration, keeping in mind that the validation set is always ahead of the training set. At the first iteration, one trains the candidate model on the weather data from January to March and validates on April’s data, and for the next iteration, trains on data from January to April and validates on May's data, and so on to the end of the training set. There are 5 such iterations in total.
We'll look at the time series cross-validator from the scikit-learn
library:
from sklearn.model_selection import TimeSeriesSplit import statsmodels.api as sm import pandas as pd # Load the dataset df = sm.datasets.get_rdataset('weather', 'nycflights13').data df['observation_time'] = pd.to_datetime(df.time_hour) df.drop(columns=['year', 'month', 'day', 'hour', 'time_hour'], inplace=True) df.head() # Create TimeSeriesSplit model tscv = TimeSeriesSplit(n_splits=5) # Split train and test sets for i, (train_index, test_index) in enumerate(tscv.split(df)): print(f'Fold {i}:') print(f' Train: index={train_index}') print(f' Test: index={test_index}')
Дякуємо за ваш відгук!