Train/Test Split & Cross Validation | Time Series Data Processing
Data Preprocessing

# Train/Test Split & Cross Validation

The final topic on time series will be preparing data for training and testing machine learning models: train/test split and cross-validation.

When splitting time series data into training and testing sets, it is important to consider the temporal aspect of the data. Unlike in other types of datasets, random sampling for splitting is not appropriate for time series data since it can lead to data leakage and biased evaluation of the model's performance.

The most common method for splitting time series data is to use a fixed point in time as the split point between the training and testing sets. The training set includes all the observations before the split point, while the testing set includes all the observations after the split point.

For example, it can look like this:

Cross-validation works on the same idea - split the training set into two parts (as before) at each iteration, keeping in mind that the validation set is always ahead of the training set. At the first iteration, one trains the candidate model on the weather data from January to March and validates on April’s data, and for the next iteration, trains on data from January to April and validates on May's data, and so on to the end of the training set. There are 5 such iterations in total.

We'll look at the time series cross-validator from the `scikit-learn` library:

Everything was clear?

Section 4. Chapter 5

Course Content

Data Preprocessing

## Data Preprocessing

2. Processing Quantitative Data

3. Processing Categorical Data

4. Time Series Data Processing

# Train/Test Split & Cross Validation

The final topic on time series will be preparing data for training and testing machine learning models: train/test split and cross-validation.

When splitting time series data into training and testing sets, it is important to consider the temporal aspect of the data. Unlike in other types of datasets, random sampling for splitting is not appropriate for time series data since it can lead to data leakage and biased evaluation of the model's performance.

The most common method for splitting time series data is to use a fixed point in time as the split point between the training and testing sets. The training set includes all the observations before the split point, while the testing set includes all the observations after the split point.

For example, it can look like this:

Cross-validation works on the same idea - split the training set into two parts (as before) at each iteration, keeping in mind that the validation set is always ahead of the training set. At the first iteration, one trains the candidate model on the weather data from January to March and validates on April’s data, and for the next iteration, trains on data from January to April and validates on May's data, and so on to the end of the training set. There are 5 such iterations in total.

We'll look at the time series cross-validator from the `scikit-learn` library:

Everything was clear?

Section 4. Chapter 5