Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Oppiskele Challenge: Clean a List of News Sources | Data Collection and Cleaning for Journalists
Python for Journalists and Media

bookChallenge: Clean a List of News Sources

Clean, reliable data is critical for journalists who want to build trustworthy media databases. When working with lists of news sources, data often arrives in a messy state: duplicate entries can inflate counts, missing website links can leave gaps in research, and inconsistent capitalization can make automated analysis difficult. Ensuring your data is clean not only saves time but also prevents errors in your reporting.

123456789101112131415161718192021222324252627
import pandas as pd # Example: Messy news sources data data = { "Name": [ "the daily news", "The Daily News", "Global Times", "global times", "Metro Herald", "Metro herald", "Metro Herald", "The Observer", "The Observer" ], "Website": [ "www.dailynews.com", None, "www.globaltimes.com", "www.globaltimes.com", "www.metroherald.com", None, None, "www.observer.com", None ] } df = pd.DataFrame(data) # Remove duplicate rows based on both columns df = df.drop_duplicates() # Fill missing website URLs with 'Unknown' df["Website"] = df["Website"].fillna("Unknown") # Standardize news source names to title case df["Name"] = df["Name"].str.title() # Output the cleaned DataFrame print(df)
copy

Cleaning your data in this way makes your media analysis more reliable. By removing duplicates, you ensure each source is only counted once. Filling in missing website URLs with a placeholder like "Unknown" allows you to spot gaps without breaking your workflow. Standardizing name capitalization avoids mismatches and makes grouping or filtering sources much easier. Clean data leads to more accurate reporting and helps maintain the credibility of your findings.

Tehtävä

Swipe to start coding

Write a function that takes a DataFrame with news source names and website URLs, and returns a cleaned DataFrame by:

  • Removing duplicate rows;
  • Filling missing website URLs with 'Unknown';
  • Capitalizing each word in the news source names.

Ratkaisu

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 1. Luku 5
single

single

Kysy tekoälyä

expand

Kysy tekoälyä

ChatGPT

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Suggested prompts:

Can you explain how to handle other types of inconsistencies in the data?

What if I want to merge sources that have the same name but different websites?

How can I automate this cleaning process for larger datasets?

close

bookChallenge: Clean a List of News Sources

Pyyhkäise näyttääksesi valikon

Clean, reliable data is critical for journalists who want to build trustworthy media databases. When working with lists of news sources, data often arrives in a messy state: duplicate entries can inflate counts, missing website links can leave gaps in research, and inconsistent capitalization can make automated analysis difficult. Ensuring your data is clean not only saves time but also prevents errors in your reporting.

123456789101112131415161718192021222324252627
import pandas as pd # Example: Messy news sources data data = { "Name": [ "the daily news", "The Daily News", "Global Times", "global times", "Metro Herald", "Metro herald", "Metro Herald", "The Observer", "The Observer" ], "Website": [ "www.dailynews.com", None, "www.globaltimes.com", "www.globaltimes.com", "www.metroherald.com", None, None, "www.observer.com", None ] } df = pd.DataFrame(data) # Remove duplicate rows based on both columns df = df.drop_duplicates() # Fill missing website URLs with 'Unknown' df["Website"] = df["Website"].fillna("Unknown") # Standardize news source names to title case df["Name"] = df["Name"].str.title() # Output the cleaned DataFrame print(df)
copy

Cleaning your data in this way makes your media analysis more reliable. By removing duplicates, you ensure each source is only counted once. Filling in missing website URLs with a placeholder like "Unknown" allows you to spot gaps without breaking your workflow. Standardizing name capitalization avoids mismatches and makes grouping or filtering sources much easier. Clean data leads to more accurate reporting and helps maintain the credibility of your findings.

Tehtävä

Swipe to start coding

Write a function that takes a DataFrame with news source names and website URLs, and returns a cleaned DataFrame by:

  • Removing duplicate rows;
  • Filling missing website URLs with 'Unknown';
  • Capitalizing each word in the news source names.

Ratkaisu

Switch to desktopVaihda työpöytään todellista harjoitusta vartenJatka siitä, missä olet käyttämällä jotakin alla olevista vaihtoehdoista
Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 1. Luku 5
single

single

some-alt