Summarizing Data
Summarizing data is essential for getting a quick understanding of its structure and patterns.
Quick Summary of the Dataset
Before performing a detailed analysis, it is useful to generate a quick overview of the dataset. This helps you understand the ranges, distributions, and presence of categorical values at a glance. You can use the summary()
function for this.
summary(df)
Summary Statistics for a Single Column
You can calculate basic descriptive statistics such as the mean, median, and standard deviation for individual columns. For example, here's how to summarize the selling_price
column.
Base R
There are dedicated functions like mean()
, median()
, and sd()
at your disposal. The argument na.rm = TRUE
ensures that missing values are ignored during calculation.
mean(df$selling_price, na.rm = TRUE)
median(df$selling_price, na.rm = TRUE)
sd(df$selling_price, na.rm = TRUE)
dplyr
You can compute all three statistics in a single step with the summarise()
function.
df %>%
summarise(
mean_price = mean(selling_price, na.rm = TRUE),
median_price = median(selling_price, na.rm = TRUE),
sd_price = sd(selling_price, na.rm = TRUE)
)
Summarizing Multiple Columns by Group
Often, you'll want to compare summary statistics across different groups in your dataset. For example, you might calculate the average selling price and average mileage for each type of fuel.
Before summarizing, make sure that the mileage
column is numeric:
df$mileage <- as.numeric(gsub(" km.*", "", df$mileage))
str(df$mileage)
Base R
The aggregate()
function can be used to compute grouped statistics. The cbind()
function allows summarizing multiple numeric columns at once.
aggregate(cbind(selling_price, mileage) ~ fuel, data = df, FUN = mean, na.rm = TRUE)
dplyr
Grouping and summarizing can also be done using group_by()
and summarise()
. This approach is generally more readable and easier to extend.
df %>%
group_by(fuel) %>%
summarise(
mean_price = mean(selling_price, na.rm = TRUE),
mean_mileage = mean(mileage, na.rm = TRUE)
)
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 4
Summarizing Data
Swipe to show menu
Summarizing data is essential for getting a quick understanding of its structure and patterns.
Quick Summary of the Dataset
Before performing a detailed analysis, it is useful to generate a quick overview of the dataset. This helps you understand the ranges, distributions, and presence of categorical values at a glance. You can use the summary()
function for this.
summary(df)
Summary Statistics for a Single Column
You can calculate basic descriptive statistics such as the mean, median, and standard deviation for individual columns. For example, here's how to summarize the selling_price
column.
Base R
There are dedicated functions like mean()
, median()
, and sd()
at your disposal. The argument na.rm = TRUE
ensures that missing values are ignored during calculation.
mean(df$selling_price, na.rm = TRUE)
median(df$selling_price, na.rm = TRUE)
sd(df$selling_price, na.rm = TRUE)
dplyr
You can compute all three statistics in a single step with the summarise()
function.
df %>%
summarise(
mean_price = mean(selling_price, na.rm = TRUE),
median_price = median(selling_price, na.rm = TRUE),
sd_price = sd(selling_price, na.rm = TRUE)
)
Summarizing Multiple Columns by Group
Often, you'll want to compare summary statistics across different groups in your dataset. For example, you might calculate the average selling price and average mileage for each type of fuel.
Before summarizing, make sure that the mileage
column is numeric:
df$mileage <- as.numeric(gsub(" km.*", "", df$mileage))
str(df$mileage)
Base R
The aggregate()
function can be used to compute grouped statistics. The cbind()
function allows summarizing multiple numeric columns at once.
aggregate(cbind(selling_price, mileage) ~ fuel, data = df, FUN = mean, na.rm = TRUE)
dplyr
Grouping and summarizing can also be done using group_by()
and summarise()
. This approach is generally more readable and easier to extend.
df %>%
group_by(fuel) %>%
summarise(
mean_price = mean(selling_price, na.rm = TRUE),
mean_mileage = mean(mileage, na.rm = TRUE)
)
Thanks for your feedback!