Challenge: Clean a List of News Sources
Clean, reliable data is critical for journalists who want to build trustworthy media databases. When working with lists of news sources, data often arrives in a messy state: duplicate entries can inflate counts, missing website links can leave gaps in research, and inconsistent capitalization can make automated analysis difficult. Ensuring your data is clean not only saves time but also prevents errors in your reporting.
123456789101112131415161718192021222324252627import pandas as pd # Example: Messy news sources data data = { "Name": [ "the daily news", "The Daily News", "Global Times", "global times", "Metro Herald", "Metro herald", "Metro Herald", "The Observer", "The Observer" ], "Website": [ "www.dailynews.com", None, "www.globaltimes.com", "www.globaltimes.com", "www.metroherald.com", None, None, "www.observer.com", None ] } df = pd.DataFrame(data) # Remove duplicate rows based on both columns df = df.drop_duplicates() # Fill missing website URLs with 'Unknown' df["Website"] = df["Website"].fillna("Unknown") # Standardize news source names to title case df["Name"] = df["Name"].str.title() # Output the cleaned DataFrame print(df)
Cleaning your data in this way makes your media analysis more reliable. By removing duplicates, you ensure each source is only counted once. Filling in missing website URLs with a placeholder like "Unknown" allows you to spot gaps without breaking your workflow. Standardizing name capitalization avoids mismatches and makes grouping or filtering sources much easier. Clean data leads to more accurate reporting and helps maintain the credibility of your findings.
Swipe to start coding
Write a function that takes a DataFrame with news source names and website URLs, and returns a cleaned DataFrame by:
- Removing duplicate rows;
- Filling missing website URLs with 'Unknown';
- Capitalizing each word in the news source names.
Solución
¡Gracias por tus comentarios!
single
Pregunte a AI
Pregunte a AI
Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla
Can you explain how to handle other types of inconsistencies in the data?
What if I want to merge sources that have the same name but different websites?
How can I automate this cleaning process for larger datasets?
Genial!
Completion tasa mejorada a 4.76
Challenge: Clean a List of News Sources
Desliza para mostrar el menú
Clean, reliable data is critical for journalists who want to build trustworthy media databases. When working with lists of news sources, data often arrives in a messy state: duplicate entries can inflate counts, missing website links can leave gaps in research, and inconsistent capitalization can make automated analysis difficult. Ensuring your data is clean not only saves time but also prevents errors in your reporting.
123456789101112131415161718192021222324252627import pandas as pd # Example: Messy news sources data data = { "Name": [ "the daily news", "The Daily News", "Global Times", "global times", "Metro Herald", "Metro herald", "Metro Herald", "The Observer", "The Observer" ], "Website": [ "www.dailynews.com", None, "www.globaltimes.com", "www.globaltimes.com", "www.metroherald.com", None, None, "www.observer.com", None ] } df = pd.DataFrame(data) # Remove duplicate rows based on both columns df = df.drop_duplicates() # Fill missing website URLs with 'Unknown' df["Website"] = df["Website"].fillna("Unknown") # Standardize news source names to title case df["Name"] = df["Name"].str.title() # Output the cleaned DataFrame print(df)
Cleaning your data in this way makes your media analysis more reliable. By removing duplicates, you ensure each source is only counted once. Filling in missing website URLs with a placeholder like "Unknown" allows you to spot gaps without breaking your workflow. Standardizing name capitalization avoids mismatches and makes grouping or filtering sources much easier. Clean data leads to more accurate reporting and helps maintain the credibility of your findings.
Swipe to start coding
Write a function that takes a DataFrame with news source names and website URLs, and returns a cleaned DataFrame by:
- Removing duplicate rows;
- Filling missing website URLs with 'Unknown';
- Capitalizing each word in the news source names.
Solución
¡Gracias por tus comentarios!
single