Challenge: Flag Duplicate Entries
In many real-world data cleaning scenarios, you may want to flag duplicate entries rather than remove them right away. Flagging gives you the flexibility to review duplicates, analyze their patterns, and make informed decisions about which ones to keep or discard. For instance, in customer databases, you may want to investigate why duplicates occur before deletion, or in transactional data, you might need to audit the records before any removal. By marking duplicates, you can also generate reports, track data quality issues, and collaborate with others on resolution strategies without losing potentially valuable information.
123456789import pandas as pd data = { "name": ["Alice", "Bob", "Alice", "Charlie", "Bob"], "age": [25, 30, 25, 35, 30] } df = pd.DataFrame(data) print(df)
Swipe to start coding
Write a function that adds a new column called is_duplicate to the DataFrame. Each row in this column should be True if the row is a duplicate of a previous row (based on all columns), and False otherwise. The function must return the modified DataFrame.
Løsning
Tak for dine kommentarer!
single
Spørg AI
Spørg AI
Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat
Awesome!
Completion rate improved to 5.56
Challenge: Flag Duplicate Entries
Stryg for at vise menuen
In many real-world data cleaning scenarios, you may want to flag duplicate entries rather than remove them right away. Flagging gives you the flexibility to review duplicates, analyze their patterns, and make informed decisions about which ones to keep or discard. For instance, in customer databases, you may want to investigate why duplicates occur before deletion, or in transactional data, you might need to audit the records before any removal. By marking duplicates, you can also generate reports, track data quality issues, and collaborate with others on resolution strategies without losing potentially valuable information.
123456789import pandas as pd data = { "name": ["Alice", "Bob", "Alice", "Charlie", "Bob"], "age": [25, 30, 25, 35, 30] } df = pd.DataFrame(data) print(df)
Swipe to start coding
Write a function that adds a new column called is_duplicate to the DataFrame. Each row in this column should be True if the row is a duplicate of a previous row (based on all columns), and False otherwise. The function must return the modified DataFrame.
Løsning
Tak for dine kommentarer!
single