Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Challenge: Clean a List of News Sources | Data Collection and Cleaning for Journalists
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Python for Journalists and Media

bookChallenge: Clean a List of News Sources

Clean, reliable data is critical for journalists who want to build trustworthy media databases. When working with lists of news sources, data often arrives in a messy state: duplicate entries can inflate counts, missing website links can leave gaps in research, and inconsistent capitalization can make automated analysis difficult. Ensuring your data is clean not only saves time but also prevents errors in your reporting.

123456789101112131415161718192021222324252627
import pandas as pd # Example: Messy news sources data data = { "Name": [ "the daily news", "The Daily News", "Global Times", "global times", "Metro Herald", "Metro herald", "Metro Herald", "The Observer", "The Observer" ], "Website": [ "www.dailynews.com", None, "www.globaltimes.com", "www.globaltimes.com", "www.metroherald.com", None, None, "www.observer.com", None ] } df = pd.DataFrame(data) # Remove duplicate rows based on both columns df = df.drop_duplicates() # Fill missing website URLs with 'Unknown' df["Website"] = df["Website"].fillna("Unknown") # Standardize news source names to title case df["Name"] = df["Name"].str.title() # Output the cleaned DataFrame print(df)
copy

Cleaning your data in this way makes your media analysis more reliable. By removing duplicates, you ensure each source is only counted once. Filling in missing website URLs with a placeholder like "Unknown" allows you to spot gaps without breaking your workflow. Standardizing name capitalization avoids mismatches and makes grouping or filtering sources much easier. Clean data leads to more accurate reporting and helps maintain the credibility of your findings.

Oppgave

Swipe to start coding

Write a function that takes a DataFrame with news source names and website URLs, and returns a cleaned DataFrame by:

  • Removing duplicate rows;
  • Filling missing website URLs with 'Unknown';
  • Capitalizing each word in the news source names.

Løsning

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 5
single

single

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Suggested prompts:

Can you explain how to handle other types of inconsistencies in the data?

What if I want to merge sources that have the same name but different websites?

How can I automate this cleaning process for larger datasets?

close

bookChallenge: Clean a List of News Sources

Sveip for å vise menyen

Clean, reliable data is critical for journalists who want to build trustworthy media databases. When working with lists of news sources, data often arrives in a messy state: duplicate entries can inflate counts, missing website links can leave gaps in research, and inconsistent capitalization can make automated analysis difficult. Ensuring your data is clean not only saves time but also prevents errors in your reporting.

123456789101112131415161718192021222324252627
import pandas as pd # Example: Messy news sources data data = { "Name": [ "the daily news", "The Daily News", "Global Times", "global times", "Metro Herald", "Metro herald", "Metro Herald", "The Observer", "The Observer" ], "Website": [ "www.dailynews.com", None, "www.globaltimes.com", "www.globaltimes.com", "www.metroherald.com", None, None, "www.observer.com", None ] } df = pd.DataFrame(data) # Remove duplicate rows based on both columns df = df.drop_duplicates() # Fill missing website URLs with 'Unknown' df["Website"] = df["Website"].fillna("Unknown") # Standardize news source names to title case df["Name"] = df["Name"].str.title() # Output the cleaned DataFrame print(df)
copy

Cleaning your data in this way makes your media analysis more reliable. By removing duplicates, you ensure each source is only counted once. Filling in missing website URLs with a placeholder like "Unknown" allows you to spot gaps without breaking your workflow. Standardizing name capitalization avoids mismatches and makes grouping or filtering sources much easier. Clean data leads to more accurate reporting and helps maintain the credibility of your findings.

Oppgave

Swipe to start coding

Write a function that takes a DataFrame with news source names and website URLs, and returns a cleaned DataFrame by:

  • Removing duplicate rows;
  • Filling missing website URLs with 'Unknown';
  • Capitalizing each word in the news source names.

Løsning

Switch to desktopBytt til skrivebordet for virkelighetspraksisFortsett der du er med et av alternativene nedenfor
Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 5
single

single

some-alt