Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Data Validation Rules | Data Quality Essentials
Working with Text, Dates, and Data Cleaning in R

bookData Validation Rules

Data validation is a crucial step in any data cleaning workflow. It involves checking that your data meets certain expectations or rules before further analysis or processing. By applying data validation, you ensure that your data is accurate, consistent, and reliable, which helps prevent errors and misleading conclusions.

Common validation rules include:

  • Range checks: such as ensuring ages fall between 0 and 120;
  • Type checks: verifying that a column contains only numbers or only dates;
  • Uniqueness checks: making sure that IDs are not repeated.

These rules help catch errors early, maintain data integrity, and support trustworthy results.

12345678910111213
# Sample data frame df <- data.frame( id = c(101, 102, 103, 102), age = c(25, 130, 45, 33) ) # Range check for age: should be between 0 and 120 df$age_valid <- ifelse(df$age >= 0 & df$age <= 120, TRUE, FALSE) # Uniqueness check for id: flag duplicated IDs df$id_unique <- !duplicated(df$id) print(df)
copy

In the code above, you see two common validation checks in action. The range check for age uses a logical condition to determine if each value falls between 0 and 120, flagging any violations as FALSE. The uniqueness check for id uses the duplicated() function to identify repeated IDs, flagging the first occurrence as unique and any subsequent duplicates as not unique. Both checks add new logical columns to the data frame, clearly indicating where validation rules are violated.

12345
# Using ifelse() to flag invalid ages df$age_flag <- ifelse(df$age >= 0 & df$age <= 120, "valid", "invalid") # Using stopifnot() to halt execution if any age is out of range stopifnot(all(df$age >= 0 & df$age <= 120))
copy

The ifelse() function in R is useful for flagging invalid entries by assigning a label such as "invalid" whenever a rule is broken. On the other hand, stopifnot() is used to enforce strict validation: if any value fails the rule, R will immediately stop execution and return an error. This approach is helpful when you need to guarantee that your data meets all requirements before proceeding.

Note
Definition

Data validation is the process of checking that data values meet specific rules or constraints before they are used.

Examples of business rules:

  • Age must be between 0 and 120;
  • Email addresses must contain the @ character;
  • Order dates must not be in the future.

1. What is the purpose of data validation in data cleaning?

2. Which R function can be used to halt execution if a validation rule is violated?

3. Give an example of a data validation rule for a column containing dates.

question mark

What is the purpose of data validation in data cleaning?

Select the correct answer

question mark

Which R function can be used to halt execution if a validation rule is violated?

Select the correct answer

question mark

Give an example of a data validation rule for a column containing dates.

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 5

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

bookData Validation Rules

Swipe to show menu

Data validation is a crucial step in any data cleaning workflow. It involves checking that your data meets certain expectations or rules before further analysis or processing. By applying data validation, you ensure that your data is accurate, consistent, and reliable, which helps prevent errors and misleading conclusions.

Common validation rules include:

  • Range checks: such as ensuring ages fall between 0 and 120;
  • Type checks: verifying that a column contains only numbers or only dates;
  • Uniqueness checks: making sure that IDs are not repeated.

These rules help catch errors early, maintain data integrity, and support trustworthy results.

12345678910111213
# Sample data frame df <- data.frame( id = c(101, 102, 103, 102), age = c(25, 130, 45, 33) ) # Range check for age: should be between 0 and 120 df$age_valid <- ifelse(df$age >= 0 & df$age <= 120, TRUE, FALSE) # Uniqueness check for id: flag duplicated IDs df$id_unique <- !duplicated(df$id) print(df)
copy

In the code above, you see two common validation checks in action. The range check for age uses a logical condition to determine if each value falls between 0 and 120, flagging any violations as FALSE. The uniqueness check for id uses the duplicated() function to identify repeated IDs, flagging the first occurrence as unique and any subsequent duplicates as not unique. Both checks add new logical columns to the data frame, clearly indicating where validation rules are violated.

12345
# Using ifelse() to flag invalid ages df$age_flag <- ifelse(df$age >= 0 & df$age <= 120, "valid", "invalid") # Using stopifnot() to halt execution if any age is out of range stopifnot(all(df$age >= 0 & df$age <= 120))
copy

The ifelse() function in R is useful for flagging invalid entries by assigning a label such as "invalid" whenever a rule is broken. On the other hand, stopifnot() is used to enforce strict validation: if any value fails the rule, R will immediately stop execution and return an error. This approach is helpful when you need to guarantee that your data meets all requirements before proceeding.

Note
Definition

Data validation is the process of checking that data values meet specific rules or constraints before they are used.

Examples of business rules:

  • Age must be between 0 and 120;
  • Email addresses must contain the @ character;
  • Order dates must not be in the future.

1. What is the purpose of data validation in data cleaning?

2. Which R function can be used to halt execution if a validation rule is violated?

3. Give an example of a data validation rule for a column containing dates.

question mark

What is the purpose of data validation in data cleaning?

Select the correct answer

question mark

Which R function can be used to halt execution if a validation rule is violated?

Select the correct answer

question mark

Give an example of a data validation rule for a column containing dates.

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 5
some-alt