Learn Summarizing Data | Data Manipulation and Cleaning

Summarizing data is essential for getting a quick understanding of its structure and patterns.

Quick Summary of the Dataset

Before performing a detailed analysis, it is useful to generate a quick overview of the dataset. This helps you understand the ranges, distributions, and presence of categorical values at a glance. You can use the summary() function for this.

summary(df)

Summary Statistics for a Single Column

You can calculate basic descriptive statistics such as the mean, median, and standard deviation for individual columns. For example, here's how to summarize the selling_price column.

Base R

There are dedicated functions like mean(), median(), and sd() at your disposal. The argument na.rm = TRUE ensures that missing values are ignored during calculation.

mean(df$selling_price, na.rm = TRUE)
median(df$selling_price, na.rm = TRUE)
sd(df$selling_price, na.rm = TRUE)

dplyr

You can compute all three statistics in a single step with the summarise() function.

df %>%
  summarise(
    mean_price = mean(selling_price, na.rm = TRUE),
    median_price = median(selling_price, na.rm = TRUE),
    sd_price = sd(selling_price, na.rm = TRUE)
  )

Summarizing Multiple Columns by Group

Often, you'll want to compare summary statistics across different groups in your dataset. For example, you might calculate the average selling price and average mileage for each type of fuel.

Before summarizing, make sure that the mileage column is numeric:

df$mileage <- as.numeric(gsub(" km.*", "", df$mileage))
str(df$mileage)

Base R

The aggregate() function can be used to compute grouped statistics. The cbind() function allows summarizing multiple numeric columns at once.

aggregate(cbind(selling_price, mileage) ~ fuel, data = df, FUN = mean, na.rm = TRUE)

dplyr

Grouping and summarizing can also be done using group_by() and summarise(). This approach is generally more readable and easier to extend.

df %>%
  group_by(fuel) %>%
  summarise(
    mean_price = mean(selling_price, na.rm = TRUE),
    mean_mileage = mean(mileage, na.rm = TRUE)
  )

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 11

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

What does the summary() function output look like?

How do I handle non-numeric columns when summarizing data?

Can you explain how to interpret the mean, median, and standard deviation results?

Swipe to show menu

Summarizing data is essential for getting a quick understanding of its structure and patterns.

Quick Summary of the Dataset

summary(df)

Summary Statistics for a Single Column

You can calculate basic descriptive statistics such as the mean, median, and standard deviation for individual columns. For example, here's how to summarize the selling_price column.

Base R

There are dedicated functions like mean(), median(), and sd() at your disposal. The argument na.rm = TRUE ensures that missing values are ignored during calculation.

mean(df$selling_price, na.rm = TRUE)
median(df$selling_price, na.rm = TRUE)
sd(df$selling_price, na.rm = TRUE)

dplyr

You can compute all three statistics in a single step with the summarise() function.

df %>%
  summarise(
    mean_price = mean(selling_price, na.rm = TRUE),
    median_price = median(selling_price, na.rm = TRUE),
    sd_price = sd(selling_price, na.rm = TRUE)
  )

Summarizing Multiple Columns by Group

Often, you'll want to compare summary statistics across different groups in your dataset. For example, you might calculate the average selling price and average mileage for each type of fuel.

Before summarizing, make sure that the mileage column is numeric:

df$mileage <- as.numeric(gsub(" km.*", "", df$mileage))
str(df$mileage)

Base R

The aggregate() function can be used to compute grouped statistics. The cbind() function allows summarizing multiple numeric columns at once.

aggregate(cbind(selling_price, mileage) ~ fuel, data = df, FUN = mean, na.rm = TRUE)

dplyr

Grouping and summarizing can also be done using group_by() and summarise(). This approach is generally more readable and easier to extend.

df %>%
  group_by(fuel) %>%
  summarise(
    mean_price = mean(selling_price, na.rm = TRUE),
    mean_mileage = mean(mileage, na.rm = TRUE)
  )

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 11