Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära Challenge: Clean a List of News Sources | Data Collection and Cleaning for Journalists
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Python for Journalists and Media

bookChallenge: Clean a List of News Sources

Clean, reliable data is critical for journalists who want to build trustworthy media databases. When working with lists of news sources, data often arrives in a messy state: duplicate entries can inflate counts, missing website links can leave gaps in research, and inconsistent capitalization can make automated analysis difficult. Ensuring your data is clean not only saves time but also prevents errors in your reporting.

123456789101112131415161718192021222324252627
import pandas as pd # Example: Messy news sources data data = { "Name": [ "the daily news", "The Daily News", "Global Times", "global times", "Metro Herald", "Metro herald", "Metro Herald", "The Observer", "The Observer" ], "Website": [ "www.dailynews.com", None, "www.globaltimes.com", "www.globaltimes.com", "www.metroherald.com", None, None, "www.observer.com", None ] } df = pd.DataFrame(data) # Remove duplicate rows based on both columns df = df.drop_duplicates() # Fill missing website URLs with 'Unknown' df["Website"] = df["Website"].fillna("Unknown") # Standardize news source names to title case df["Name"] = df["Name"].str.title() # Output the cleaned DataFrame print(df)
copy

Cleaning your data in this way makes your media analysis more reliable. By removing duplicates, you ensure each source is only counted once. Filling in missing website URLs with a placeholder like "Unknown" allows you to spot gaps without breaking your workflow. Standardizing name capitalization avoids mismatches and makes grouping or filtering sources much easier. Clean data leads to more accurate reporting and helps maintain the credibility of your findings.

Uppgift

Swipe to start coding

Write a function that takes a DataFrame with news source names and website URLs, and returns a cleaned DataFrame by:

  • Removing duplicate rows;
  • Filling missing website URLs with 'Unknown';
  • Capitalizing each word in the news source names.

Lösning

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 5
single

single

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Suggested prompts:

Can you explain how to handle other types of inconsistencies in the data?

What if I want to merge sources that have the same name but different websites?

How can I automate this cleaning process for larger datasets?

close

bookChallenge: Clean a List of News Sources

Svep för att visa menyn

Clean, reliable data is critical for journalists who want to build trustworthy media databases. When working with lists of news sources, data often arrives in a messy state: duplicate entries can inflate counts, missing website links can leave gaps in research, and inconsistent capitalization can make automated analysis difficult. Ensuring your data is clean not only saves time but also prevents errors in your reporting.

123456789101112131415161718192021222324252627
import pandas as pd # Example: Messy news sources data data = { "Name": [ "the daily news", "The Daily News", "Global Times", "global times", "Metro Herald", "Metro herald", "Metro Herald", "The Observer", "The Observer" ], "Website": [ "www.dailynews.com", None, "www.globaltimes.com", "www.globaltimes.com", "www.metroherald.com", None, None, "www.observer.com", None ] } df = pd.DataFrame(data) # Remove duplicate rows based on both columns df = df.drop_duplicates() # Fill missing website URLs with 'Unknown' df["Website"] = df["Website"].fillna("Unknown") # Standardize news source names to title case df["Name"] = df["Name"].str.title() # Output the cleaned DataFrame print(df)
copy

Cleaning your data in this way makes your media analysis more reliable. By removing duplicates, you ensure each source is only counted once. Filling in missing website URLs with a placeholder like "Unknown" allows you to spot gaps without breaking your workflow. Standardizing name capitalization avoids mismatches and makes grouping or filtering sources much easier. Clean data leads to more accurate reporting and helps maintain the credibility of your findings.

Uppgift

Swipe to start coding

Write a function that takes a DataFrame with news source names and website URLs, and returns a cleaned DataFrame by:

  • Removing duplicate rows;
  • Filling missing website URLs with 'Unknown';
  • Capitalizing each word in the news source names.

Lösning

Switch to desktopByt till skrivbordet för praktisk övningFortsätt där du är med ett av alternativen nedan
Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 5
single

single

some-alt