Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda Splitting Data for Training and Testing | Model Evaluation and Machine Learning Workflows
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
R for Data Scientists

bookSplitting Data for Training and Testing

When building predictive models, you must ensure that your evaluation of model performance is fair and unbiased. If you train and test your model on the same data, it is very likely to overestimate how well the model will perform on new, unseen data. To prevent this, you should split your dataset into separate training and test sets. The training set is used to fit the model, while the test set is held back and used only to evaluate the model's predictive ability. This separation mimics the real-world scenario where you apply your model to new data, and it helps you detect overfitting.

1234567891011121314151617
library(rsample) set.seed(123) df <- data.frame( x1 = rnorm(100), x2 = runif(100), y = rnorm(100) ) split_obj <- initial_split(df, prop = 0.7) train_data <- training(split_obj) test_data <- testing(split_obj) print(dim(train_data)) print(dim(test_data))
copy

In the code, initial_split() from the rsample package creates a split object that contains information about how the data frame is partitioned. The prop argument controls the proportion of data assigned to the training set; for example, prop = 0.7 means 70% of the data is used for training and the remainder for testing. The training() and testing() functions extract the actual data frames for the training and test sets from the split object. This approach ensures a clear separation between the data used to fit the model and the data reserved for evaluation.

Note
Note

Be careful to avoid data leakage, which happens when information from the test set influences the training process — this leads to overly optimistic model performance. Always split your data before any preprocessing or feature engineering that could leak information. Additionally, if your outcome variable is highly imbalanced (for example, rare events in classification), consider using stratified sampling with initial_split() to preserve the class proportions in both training and test sets.

question mark

Why is it important to split your data into training and test sets when building predictive models?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 4. Capítulo 1

Pergunte à IA

expand

Pergunte à IA

ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

bookSplitting Data for Training and Testing

Deslize para mostrar o menu

When building predictive models, you must ensure that your evaluation of model performance is fair and unbiased. If you train and test your model on the same data, it is very likely to overestimate how well the model will perform on new, unseen data. To prevent this, you should split your dataset into separate training and test sets. The training set is used to fit the model, while the test set is held back and used only to evaluate the model's predictive ability. This separation mimics the real-world scenario where you apply your model to new data, and it helps you detect overfitting.

1234567891011121314151617
library(rsample) set.seed(123) df <- data.frame( x1 = rnorm(100), x2 = runif(100), y = rnorm(100) ) split_obj <- initial_split(df, prop = 0.7) train_data <- training(split_obj) test_data <- testing(split_obj) print(dim(train_data)) print(dim(test_data))
copy

In the code, initial_split() from the rsample package creates a split object that contains information about how the data frame is partitioned. The prop argument controls the proportion of data assigned to the training set; for example, prop = 0.7 means 70% of the data is used for training and the remainder for testing. The training() and testing() functions extract the actual data frames for the training and test sets from the split object. This approach ensures a clear separation between the data used to fit the model and the data reserved for evaluation.

Note
Note

Be careful to avoid data leakage, which happens when information from the test set influences the training process — this leads to overly optimistic model performance. Always split your data before any preprocessing or feature engineering that could leak information. Additionally, if your outcome variable is highly imbalanced (for example, rare events in classification), consider using stratified sampling with initial_split() to preserve the class proportions in both training and test sets.

question mark

Why is it important to split your data into training and test sets when building predictive models?

Select the correct answer

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 4. Capítulo 1
some-alt