Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Summarizing Data | Data Manipulation and Cleaning
Data Analysis with R

bookSummarizing Data

Summarizing data is essential for getting a quick understanding of its structure and patterns.

Quick Summary of the Dataset

Before performing a detailed analysis, it is useful to generate a quick overview of the dataset. This helps you understand the ranges, distributions, and presence of categorical values at a glance. You can use the summary() function for this.

summary(df)

Summary Statistics for a Single Column

You can calculate basic descriptive statistics such as the mean, median, and standard deviation for individual columns. For example, here's how to summarize the selling_price column.

Base R

There are dedicated functions like mean(), median(), and sd() at your disposal. The argument na.rm = TRUE ensures that missing values are ignored during calculation.

mean(df$selling_price, na.rm = TRUE)
median(df$selling_price, na.rm = TRUE)
sd(df$selling_price, na.rm = TRUE)

dplyr

You can compute all three statistics in a single step with the summarise() function.

df %>%
  summarise(
    mean_price = mean(selling_price, na.rm = TRUE),
    median_price = median(selling_price, na.rm = TRUE),
    sd_price = sd(selling_price, na.rm = TRUE)
  )

Summarizing Multiple Columns by Group

Often, you'll want to compare summary statistics across different groups in your dataset. For example, you might calculate the average selling price and average mileage for each type of fuel.

Before summarizing, make sure that the mileage column is numeric:

df$mileage <- as.numeric(gsub(" km.*", "", df$mileage))
str(df$mileage)

Base R

The aggregate() function can be used to compute grouped statistics. The cbind() function allows summarizing multiple numeric columns at once.

aggregate(cbind(selling_price, mileage) ~ fuel, data = df, FUN = mean, na.rm = TRUE)

dplyr

Grouping and summarizing can also be done using group_by() and summarise(). This approach is generally more readable and easier to extend.

df %>%
  group_by(fuel) %>%
  summarise(
    mean_price = mean(selling_price, na.rm = TRUE),
    mean_mileage = mean(mileage, na.rm = TRUE)
  )
question mark

aggregate() function is used in base R to:

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 11

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Awesome!

Completion rate improved to 4

bookSummarizing Data

Swipe to show menu

Summarizing data is essential for getting a quick understanding of its structure and patterns.

Quick Summary of the Dataset

Before performing a detailed analysis, it is useful to generate a quick overview of the dataset. This helps you understand the ranges, distributions, and presence of categorical values at a glance. You can use the summary() function for this.

summary(df)

Summary Statistics for a Single Column

You can calculate basic descriptive statistics such as the mean, median, and standard deviation for individual columns. For example, here's how to summarize the selling_price column.

Base R

There are dedicated functions like mean(), median(), and sd() at your disposal. The argument na.rm = TRUE ensures that missing values are ignored during calculation.

mean(df$selling_price, na.rm = TRUE)
median(df$selling_price, na.rm = TRUE)
sd(df$selling_price, na.rm = TRUE)

dplyr

You can compute all three statistics in a single step with the summarise() function.

df %>%
  summarise(
    mean_price = mean(selling_price, na.rm = TRUE),
    median_price = median(selling_price, na.rm = TRUE),
    sd_price = sd(selling_price, na.rm = TRUE)
  )

Summarizing Multiple Columns by Group

Often, you'll want to compare summary statistics across different groups in your dataset. For example, you might calculate the average selling price and average mileage for each type of fuel.

Before summarizing, make sure that the mileage column is numeric:

df$mileage <- as.numeric(gsub(" km.*", "", df$mileage))
str(df$mileage)

Base R

The aggregate() function can be used to compute grouped statistics. The cbind() function allows summarizing multiple numeric columns at once.

aggregate(cbind(selling_price, mileage) ~ fuel, data = df, FUN = mean, na.rm = TRUE)

dplyr

Grouping and summarizing can also be done using group_by() and summarise(). This approach is generally more readable and easier to extend.

df %>%
  group_by(fuel) %>%
  summarise(
    mean_price = mean(selling_price, na.rm = TRUE),
    mean_mileage = mean(mileage, na.rm = TRUE)
  )
question mark

aggregate() function is used in base R to:

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 1. ChapterΒ 11
some-alt