Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära Selecting and Filtering Data | Data Preparation and Cleaning
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
R for Data Scientists

bookSelecting and Filtering Data

When working with real-world datasets, you are often faced with more information than you actually need. Datasets can include dozens or even hundreds of columns, and many rows may not be relevant to your analysis goals. Focusing on the variables and observations that matter most allows you to streamline your workflow, improve performance, and make your results clearer and more reliable. Selecting only the necessary columns and filtering for specific rows are essential steps in preparing your data for meaningful analysis.

12345678910111213141516
library(dplyr) # Sample data frame data <- tibble::tibble( name = c("Alice", "Bob", "Charlie", "David"), age = c(25, 30, 35, 40), city = c("New York", "Los Angeles", "Chicago", "Houston"), score = c(88, 92, 95, 85) ) # Select only the name and score columns, and filter for rows where score > 90 filtered_data <- data %>% select(name, score) %>% filter(score > 90) print(as.data.frame(filtered_data))
copy

In the code above, you first use the select() function to choose only the columns you want — in this case, name and score. This helps reduce clutter and keeps your data focused on the variables of interest. Next, you use the filter() function to keep only the rows where the score column is greater than 90. The order of these operations is important: by selecting columns before filtering, you ensure that only the necessary variables are involved in the logical condition. Logical conditions in filter() use operators like >, <, ==, and !=, and you must reference column names exactly as they appear in your data.

Note
Note

Be careful not to confuse = with == when writing logical conditions inside filter(). Use == to test for equality (for example, filter(city == "Chicago")). Accidentally using = will result in an error or unintended behavior. Also, double-check your column names for typos, as select() and filter() require exact matches. Misspelled column names will cause your code to fail or return unexpected results.

question mark

What is the main difference between select() and filter() in dplyr?

Select all correct answers

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 2

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Suggested prompts:

Can you explain more about how the select() and filter() functions work in dplyr?

What other logical conditions can I use with filter()?

How can I apply these steps to a larger or different dataset?

bookSelecting and Filtering Data

Svep för att visa menyn

When working with real-world datasets, you are often faced with more information than you actually need. Datasets can include dozens or even hundreds of columns, and many rows may not be relevant to your analysis goals. Focusing on the variables and observations that matter most allows you to streamline your workflow, improve performance, and make your results clearer and more reliable. Selecting only the necessary columns and filtering for specific rows are essential steps in preparing your data for meaningful analysis.

12345678910111213141516
library(dplyr) # Sample data frame data <- tibble::tibble( name = c("Alice", "Bob", "Charlie", "David"), age = c(25, 30, 35, 40), city = c("New York", "Los Angeles", "Chicago", "Houston"), score = c(88, 92, 95, 85) ) # Select only the name and score columns, and filter for rows where score > 90 filtered_data <- data %>% select(name, score) %>% filter(score > 90) print(as.data.frame(filtered_data))
copy

In the code above, you first use the select() function to choose only the columns you want — in this case, name and score. This helps reduce clutter and keeps your data focused on the variables of interest. Next, you use the filter() function to keep only the rows where the score column is greater than 90. The order of these operations is important: by selecting columns before filtering, you ensure that only the necessary variables are involved in the logical condition. Logical conditions in filter() use operators like >, <, ==, and !=, and you must reference column names exactly as they appear in your data.

Note
Note

Be careful not to confuse = with == when writing logical conditions inside filter(). Use == to test for equality (for example, filter(city == "Chicago")). Accidentally using = will result in an error or unintended behavior. Also, double-check your column names for typos, as select() and filter() require exact matches. Misspelled column names will cause your code to fail or return unexpected results.

question mark

What is the main difference between select() and filter() in dplyr?

Select all correct answers

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 2
some-alt