Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lernen Splitting Data for Training and Testing | Model Evaluation and Machine Learning Workflows
R for Data Scientists

bookSplitting Data for Training and Testing

When building predictive models, you must ensure that your evaluation of model performance is fair and unbiased. If you train and test your model on the same data, it is very likely to overestimate how well the model will perform on new, unseen data. To prevent this, you should split your dataset into separate training and test sets. The training set is used to fit the model, while the test set is held back and used only to evaluate the model's predictive ability. This separation mimics the real-world scenario where you apply your model to new data, and it helps you detect overfitting.

1234567891011121314151617
library(rsample) set.seed(123) df <- data.frame( x1 = rnorm(100), x2 = runif(100), y = rnorm(100) ) split_obj <- initial_split(df, prop = 0.7) train_data <- training(split_obj) test_data <- testing(split_obj) print(dim(train_data)) print(dim(test_data))
copy

In the code, initial_split() from the rsample package creates a split object that contains information about how the data frame is partitioned. The prop argument controls the proportion of data assigned to the training set; for example, prop = 0.7 means 70% of the data is used for training and the remainder for testing. The training() and testing() functions extract the actual data frames for the training and test sets from the split object. This approach ensures a clear separation between the data used to fit the model and the data reserved for evaluation.

Note
Note

Be careful to avoid data leakage, which happens when information from the test set influences the training process — this leads to overly optimistic model performance. Always split your data before any preprocessing or feature engineering that could leak information. Additionally, if your outcome variable is highly imbalanced (for example, rare events in classification), consider using stratified sampling with initial_split() to preserve the class proportions in both training and test sets.

question mark

Why is it important to split your data into training and test sets when building predictive models?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 4. Kapitel 1

Fragen Sie AI

expand

Fragen Sie AI

ChatGPT

Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen

bookSplitting Data for Training and Testing

Swipe um das Menü anzuzeigen

When building predictive models, you must ensure that your evaluation of model performance is fair and unbiased. If you train and test your model on the same data, it is very likely to overestimate how well the model will perform on new, unseen data. To prevent this, you should split your dataset into separate training and test sets. The training set is used to fit the model, while the test set is held back and used only to evaluate the model's predictive ability. This separation mimics the real-world scenario where you apply your model to new data, and it helps you detect overfitting.

1234567891011121314151617
library(rsample) set.seed(123) df <- data.frame( x1 = rnorm(100), x2 = runif(100), y = rnorm(100) ) split_obj <- initial_split(df, prop = 0.7) train_data <- training(split_obj) test_data <- testing(split_obj) print(dim(train_data)) print(dim(test_data))
copy

In the code, initial_split() from the rsample package creates a split object that contains information about how the data frame is partitioned. The prop argument controls the proportion of data assigned to the training set; for example, prop = 0.7 means 70% of the data is used for training and the remainder for testing. The training() and testing() functions extract the actual data frames for the training and test sets from the split object. This approach ensures a clear separation between the data used to fit the model and the data reserved for evaluation.

Note
Note

Be careful to avoid data leakage, which happens when information from the test set influences the training process — this leads to overly optimistic model performance. Always split your data before any preprocessing or feature engineering that could leak information. Additionally, if your outcome variable is highly imbalanced (for example, rare events in classification), consider using stratified sampling with initial_split() to preserve the class proportions in both training and test sets.

question mark

Why is it important to split your data into training and test sets when building predictive models?

Select the correct answer

War alles klar?

Wie können wir es verbessern?

Danke für Ihr Feedback!

Abschnitt 4. Kapitel 1
some-alt