Summary  
This chapter covers how to implement validation rules in code—such as range checks and uniqueness checks—by adding logical flags with conditional functions (e.g., ifelse) and enforcing strict validation that halts execution on failure (e.g., stopifnot).

General domain of usage  
Data cleaning workflows

Data validation is a **crucial step** in any data cleaning workflow. It involves checking that your data meets certain expectations or **rules** before further analysis or processing. By applying data validation, you ensure that your data is **accurate**, **consistent**, and **reliable**, which helps prevent errors and misleading conclusions.

Common validation rules include:
- Range checks: such as ensuring ages fall between 0 and 120;
- Type checks: verifying that a column contains only numbers or only dates;
- Uniqueness checks: making sure that IDs are not repeated.

These rules help catch errors early, maintain **data integrity**, and support trustworthy results.

# Sample data frame
df <- data.frame(
  id = c(101, 102, 103, 102),
  age = c(25, 130, 45, 33)
)

# Range check for age: should be between 0 and 120
df$age_valid <- ifelse(df$age >= 0 & df$age <= 120, TRUE, FALSE)

# Uniqueness check for id: flag duplicated IDs
df$id_unique <- !duplicated(df$id)

print(df)

In the code above, you see two common validation checks in action. The **range check** for `age` uses a logical condition to determine if each value falls between `0` and `120`, flagging any violations as `FALSE`. The **uniqueness check** for `id` uses the `duplicated()` function to identify repeated IDs, flagging the first occurrence as unique and any subsequent duplicates as not unique. Both checks add new logical columns to the data frame, clearly indicating where validation rules are violated.

# Using ifelse() to flag invalid ages
df$age_flag <- ifelse(df$age >= 0 & df$age <= 120, "valid", "invalid")

# Using stopifnot() to halt execution if any age is out of range
stopifnot(all(df$age >= 0 & df$age <= 120))

The `ifelse()` function in R is useful for flagging invalid entries by assigning a label such as **"invalid"** whenever a rule is broken. On the other hand, `stopifnot()` is used to enforce **strict validation**: if any value fails the rule, R will immediately stop execution and return an error. This approach is helpful when you need to guarantee that your data meets all requirements before proceeding.

**Data validation** is the process of checking that data values meet specific rules or constraints before they are used.

**Examples of business rules:**
- Age must be between `0` and `120`;
- Email addresses must contain the `@` character;
- Order dates must not be in the future.

Definition

What is the purpose of data validation in data cleaning?

Which R function can be used to halt execution if a validation rule is violated?

Give an example of a data validation rule for a column containing dates.

Master the essentials of handling text, dates, and files in R. This course guides you through string manipulation, regular expressions, date-time operations, and file I/O, with engaging explanations and practical challenges for beginners.

Learn to work with text data in R, from basic string operations to cleaning messy data using regular expressions.

Master handling dates and times in R, from parsing and formatting to performing calculations and managing time zones.

Explore the foundations of data cleaning and quality assurance in R. Each chapter introduces a core concept, followed by a hands-on challenge to reinforce your learning.

Data Validation Rules

1. What is the purpose of data validation in data cleaning?

2. Which R function can be used to halt execution if a validation rule is violated?

3. Give an example of a data validation rule for a column containing dates.