Data Transformation
Data transformation is a crucial step in preparing raw data for analysis. It involves modifying, adding, or recoding variables to make the data more meaningful and analysis-ready.
Creating New Columns
A common transformation is to calculate new metrics from existing columns. For example, you might want to calculate the price per kilometer to assess how cost-effective a vehicle is.
Base R
You can create a new column by using the $
operator to define its name and assigning values to it.
df$price_per_km <- df$selling_price / df$km_driven
head(df)
dplyr
New columns can be added using the mutate()
function. Inside mutate()
, you specify the name of the new column and define how it should be calculated.
df <- df %>%
mutate(price_per_km = selling_price / km_driven)
Converting and Transforming Text-Based Numeric Data
In real-world datasets, numeric information is often stored as text combined with non-numeric characters. For example, engine power values might appear as "68 bhp", which must be cleaned and converted before analysis.
Base R
You can use gsub()
to remove unwanted text and then apply as.numeric()
to convert the result into numbers. After conversion, additional transformations can be performed, such as converting brake horsepower (bhp) into kilowatts.
df$max_power <- as.numeric(gsub(" bhp", "", df$max_power))
df$max_power_kw <- df$max_power * 0.7457 # convert to kilowatts
dplyr
The same process can be streamlined inside a mutate()
call. You can combine text replacement, type conversion, and new column creation in a single step, which makes the code cleaner and easier to read.
df <- df %>%
mutate(
max_power = as.numeric(gsub(" bhp", "", max_power)),
max_power_kw = max_power * 0.7457
)
Categorizing Data
You can create new categorical variables by grouping continuous values into meaningful categories. For example, cars can be classified into Low, Medium, or High price ranges based on their selling price.
Base R
You can do this with nested ifelse()
statements. Each condition is checked in order, and the value is assigned accordingly.
df$price_category <- ifelse(df$selling_price < 300000, "Low",
ifelse(df$selling_price < 700000, "Medium", "High"))
dplyr
You can use the case_when()
function as a replacement for nested if-else statements. This allows multiple conditions to be written in a clean, readable format.
df <- df %>%
mutate(price_category = case_when(
selling_price < 300000 ~ "Low",
selling_price < 700000 ~ "Medium",
TRUE ~ "High"
))
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Can you explain the difference between using Base R and dplyr for data transformation?
How do I handle non-numeric values when converting columns for analysis?
Can you show more examples of categorizing data using different criteria?
Awesome!
Completion rate improved to 4
Data Transformation
Swipe to show menu
Data transformation is a crucial step in preparing raw data for analysis. It involves modifying, adding, or recoding variables to make the data more meaningful and analysis-ready.
Creating New Columns
A common transformation is to calculate new metrics from existing columns. For example, you might want to calculate the price per kilometer to assess how cost-effective a vehicle is.
Base R
You can create a new column by using the $
operator to define its name and assigning values to it.
df$price_per_km <- df$selling_price / df$km_driven
head(df)
dplyr
New columns can be added using the mutate()
function. Inside mutate()
, you specify the name of the new column and define how it should be calculated.
df <- df %>%
mutate(price_per_km = selling_price / km_driven)
Converting and Transforming Text-Based Numeric Data
In real-world datasets, numeric information is often stored as text combined with non-numeric characters. For example, engine power values might appear as "68 bhp", which must be cleaned and converted before analysis.
Base R
You can use gsub()
to remove unwanted text and then apply as.numeric()
to convert the result into numbers. After conversion, additional transformations can be performed, such as converting brake horsepower (bhp) into kilowatts.
df$max_power <- as.numeric(gsub(" bhp", "", df$max_power))
df$max_power_kw <- df$max_power * 0.7457 # convert to kilowatts
dplyr
The same process can be streamlined inside a mutate()
call. You can combine text replacement, type conversion, and new column creation in a single step, which makes the code cleaner and easier to read.
df <- df %>%
mutate(
max_power = as.numeric(gsub(" bhp", "", max_power)),
max_power_kw = max_power * 0.7457
)
Categorizing Data
You can create new categorical variables by grouping continuous values into meaningful categories. For example, cars can be classified into Low, Medium, or High price ranges based on their selling price.
Base R
You can do this with nested ifelse()
statements. Each condition is checked in order, and the value is assigned accordingly.
df$price_category <- ifelse(df$selling_price < 300000, "Low",
ifelse(df$selling_price < 700000, "Medium", "High"))
dplyr
You can use the case_when()
function as a replacement for nested if-else statements. This allows multiple conditions to be written in a clean, readable format.
df <- df %>%
mutate(price_category = case_when(
selling_price < 300000 ~ "Low",
selling_price < 700000 ~ "Medium",
TRUE ~ "High"
))
Thanks for your feedback!