Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Train/Test Split & Cross Validation | Time Series Data Processing
Data Preprocessing
course content

Course Content

Data Preprocessing

Data Preprocessing

1. Brief Introduction
2. Processing Quantitative Data
3. Processing Categorical Data
4. Time Series Data Processing
5. Feature Engineering
6. Moving on to Tasks

bookTrain/Test Split & Cross Validation

The final topic on time series will be preparing data for training and testing machine learning models: train/test split and cross-validation.

When splitting time series data into training and testing sets, it is important to consider the temporal aspect of the data. Unlike in other types of datasets, random sampling for splitting is not appropriate for time series data since it can lead to data leakage and biased evaluation of the model's performance.

The most common method for splitting time series data is to use a fixed point in time as the split point between the training and testing sets. The training set includes all the observations before the split point, while the testing set includes all the observations after the split point.

For example, it can look like this:

12345678910111213
import statsmodels.api as sm import pandas as pd # Load the dataset df = sm.datasets.get_rdataset('weather', 'nycflights13').data df['observation_time'] = pd.to_datetime(df.time_hour) df.drop(columns=['year', 'month', 'day', 'hour', 'time_hour'], inplace=True) print(df.head(10)) # Split data into training and test sets based on time train = df.loc[df['observation_time'] < '2013-08-01'] test = df.loc[df['observation_time'] >= '2013-08-01']
copy

Cross-validation works on the same idea - split the training set into two parts (as before) at each iteration, keeping in mind that the validation set is always ahead of the training set. At the first iteration, one trains the candidate model on the weather data from January to March and validates on April’s data, and for the next iteration, trains on data from January to April and validates on May's data, and so on to the end of the training set. There are 5 such iterations in total.

We'll look at the time series cross-validator from the scikit-learn library:

12345678910111213141516171819
from sklearn.model_selection import TimeSeriesSplit import statsmodels.api as sm import pandas as pd # Load the dataset df = sm.datasets.get_rdataset('weather', 'nycflights13').data df['observation_time'] = pd.to_datetime(df.time_hour) df.drop(columns=['year', 'month', 'day', 'hour', 'time_hour'], inplace=True) df.head() # Create TimeSeriesSplit model tscv = TimeSeriesSplit(n_splits=5) # Split train and test sets for i, (train_index, test_index) in enumerate(tscv.split(df)): print(f'Fold {i}:') print(f' Train: index={train_index}') print(f' Test: index={test_index}')
copy

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 4. Chapter 5
some-alt