Lära Splitting Data for Training and Testing | Model Evaluation and Machine Learning Workflows

Svep för att visa menyn

When building predictive models, you must ensure that your evaluation of model performance is fair and unbiased. If you train and test your model on the same data, it is very likely to overestimate how well the model will perform on new, unseen data. To prevent this, you should split your dataset into separate training and test sets. The training set is used to fit the model, while the test set is held back and used only to evaluate the model's predictive ability. This separation mimics the real-world scenario where you apply your model to new data, and it helps you detect overfitting.


              1234567891011121314151617
            
library(rsample)

set.seed(123)

df <- data.frame(
  x1 = rnorm(100),
  x2 = runif(100),
  y  = rnorm(100)
)

split_obj <- initial_split(df, prop = 0.7)

train_data <- training(split_obj)
test_data  <- testing(split_obj)

print(dim(train_data))
print(dim(test_data))

In the code, initial_split() from the rsample package creates a split object that contains information about how the data frame is partitioned. The prop argument controls the proportion of data assigned to the training set; for example, prop = 0.7 means 70% of the data is used for training and the remainder for testing. The training() and testing() functions extract the actual data frames for the training and test sets from the split object. This approach ensures a clear separation between the data used to fit the model and the data reserved for evaluation.

Note

Be careful to avoid data leakage, which happens when information from the test set influences the training process — this leads to overly optimistic model performance. Always split your data before any preprocessing or feature engineering that could leak information. Additionally, if your outcome variable is highly imbalanced (for example, rare events in classification), consider using stratified sampling with initial_split() to preserve the class proportions in both training and test sets.

Var allt tydligt?

Tack för dina kommentarer!

Avsnitt 4. Kapitel 1

Fråga AI

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Avsnitt 4. Kapitel 1