Train/Test Split & Cross Validation

The final topic on time series will be preparing data for training and testing machine learning models: train/test split and cross-validation.

When splitting time series data into training and testing sets, it is important to consider the temporal aspect of the data. Unlike in other types of datasets, random sampling for splitting is not appropriate for time series data since it can lead to data leakage and biased evaluation of the model's performance.

The most common method for splitting time series data is to use a fixed point in time as the split point between the training and testing sets. The training set includes all the observations before the split point, while the testing set includes all the observations after the split point.

For example, it can look like this:


              12345678910111213
            
import statsmodels.api as sm
import pandas as pd

# Load the dataset
df = sm.datasets.get_rdataset('weather', 'nycflights13').data

df['observation_time'] = pd.to_datetime(df.time_hour)
df.drop(columns=['year', 'month', 'day', 'hour', 'time_hour'], inplace=True)
print(df.head(10))

# Split data into training and test sets based on time
train = df.loc[df['observation_time'] < '2013-08-01']
test = df.loc[df['observation_time'] >= '2013-08-01']

Cross-validation works on the same idea - split the training set into two parts (as before) at each iteration, keeping in mind that the validation set is always ahead of the training set. At the first iteration, one trains the candidate model on the weather data from January to March and validates on April’s data, and for the next iteration, trains on data from January to April and validates on May's data, and so on to the end of the training set. There are 5 such iterations in total.

We'll look at the time series cross-validator from the scikit-learn library:


              12345678910111213141516171819
            
from sklearn.model_selection import TimeSeriesSplit
import statsmodels.api as sm
import pandas as pd

# Load the dataset
df = sm.datasets.get_rdataset('weather', 'nycflights13').data

df['observation_time'] = pd.to_datetime(df.time_hour)
df.drop(columns=['year', 'month', 'day', 'hour', 'time_hour'], inplace=True)
df.head()

# Create TimeSeriesSplit model
tscv = TimeSeriesSplit(n_splits=5)

# Split train and test sets
for i, (train_index, test_index) in enumerate(tscv.split(df)):
    print(f'Fold {i}:')
    print(f'  Train: index={train_index}')
    print(f'  Test:  index={test_index}')

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 4. Розділ 5

Запитати АІ

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Зміст курсу

Data Preprocessing

1. Brief Introduction

Data Types Data Processing Methods Dataset: Test and Training Deleting an "Extra" Data Changing the Data Type

2. Processing Quantitative Data

Data Scaling Data Scaling vs Data Normalization Removing Outliers Removing Missing Values Data Augmentation: Synthetic Data

3. Processing Categorical Data

Methods for Encoding the Categorical Data One-Hot Encoding Ordinal Encoding Label Encoding of the Target Variable Challenge

4. Time Series Data Processing

Data Type Conversion Data Cleaning Stationarity Denoising Train/Test Split & Cross Validation Challenge

5. Feature Engineering

Technique Idea Realization Feature Extraction from Text Feature Extraction from Images Feature Extraction from Time Series Challenge

6. Moving on to Tasks

Challenge 1 Challenge 2 Challenge 3