Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära Splitting Data for Training and Testing | Model Evaluation and Machine Learning Workflows
R for Data Scientists

bookSplitting Data for Training and Testing

When building predictive models, you must ensure that your evaluation of model performance is fair and unbiased. If you train and test your model on the same data, it is very likely to overestimate how well the model will perform on new, unseen data. To prevent this, you should split your dataset into separate training and test sets. The training set is used to fit the model, while the test set is held back and used only to evaluate the model's predictive ability. This separation mimics the real-world scenario where you apply your model to new data, and it helps you detect overfitting.

1234567891011121314151617
library(rsample) set.seed(123) df <- data.frame( x1 = rnorm(100), x2 = runif(100), y = rnorm(100) ) split_obj <- initial_split(df, prop = 0.7) train_data <- training(split_obj) test_data <- testing(split_obj) print(dim(train_data)) print(dim(test_data))
copy

In the code, initial_split() from the rsample package creates a split object that contains information about how the data frame is partitioned. The prop argument controls the proportion of data assigned to the training set; for example, prop = 0.7 means 70% of the data is used for training and the remainder for testing. The training() and testing() functions extract the actual data frames for the training and test sets from the split object. This approach ensures a clear separation between the data used to fit the model and the data reserved for evaluation.

Note
Note

Be careful to avoid data leakage, which happens when information from the test set influences the training process — this leads to overly optimistic model performance. Always split your data before any preprocessing or feature engineering that could leak information. Additionally, if your outcome variable is highly imbalanced (for example, rare events in classification), consider using stratified sampling with initial_split() to preserve the class proportions in both training and test sets.

question mark

Why is it important to split your data into training and test sets when building predictive models?

Select the correct answer

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 4. Kapitel 1

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Suggested prompts:

Can you explain why overfitting is a problem in predictive modeling?

How do I choose the right proportion for splitting my data?

What are some alternatives to using a simple train-test split?

bookSplitting Data for Training and Testing

Svep för att visa menyn

When building predictive models, you must ensure that your evaluation of model performance is fair and unbiased. If you train and test your model on the same data, it is very likely to overestimate how well the model will perform on new, unseen data. To prevent this, you should split your dataset into separate training and test sets. The training set is used to fit the model, while the test set is held back and used only to evaluate the model's predictive ability. This separation mimics the real-world scenario where you apply your model to new data, and it helps you detect overfitting.

1234567891011121314151617
library(rsample) set.seed(123) df <- data.frame( x1 = rnorm(100), x2 = runif(100), y = rnorm(100) ) split_obj <- initial_split(df, prop = 0.7) train_data <- training(split_obj) test_data <- testing(split_obj) print(dim(train_data)) print(dim(test_data))
copy

In the code, initial_split() from the rsample package creates a split object that contains information about how the data frame is partitioned. The prop argument controls the proportion of data assigned to the training set; for example, prop = 0.7 means 70% of the data is used for training and the remainder for testing. The training() and testing() functions extract the actual data frames for the training and test sets from the split object. This approach ensures a clear separation between the data used to fit the model and the data reserved for evaluation.

Note
Note

Be careful to avoid data leakage, which happens when information from the test set influences the training process — this leads to overly optimistic model performance. Always split your data before any preprocessing or feature engineering that could leak information. Additionally, if your outcome variable is highly imbalanced (for example, rare events in classification), consider using stratified sampling with initial_split() to preserve the class proportions in both training and test sets.

question mark

Why is it important to split your data into training and test sets when building predictive models?

Select the correct answer

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 4. Kapitel 1
some-alt