Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Leer Cleaning Messy Data | Data Collection and Cleaning for Journalists
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Python for Journalists and Media

bookCleaning Messy Data

In journalism, data rarely arrives in a perfect, ready-to-use format. You will often encounter messy datasets filled with missing values, inconsistent formatting, and duplicate entries. These issues can distort your findings or even lead to incorrect conclusions if not handled properly. For example, a dataset of press releases from various organizations might include several versions of the same release, entries missing key information such as dates or authors, or inconsistent capitalization and date formats. Understanding how to identify and fix these problems is essential for producing reliable, accurate stories.

12345678910111213141516171819202122232425262728293031
import pandas as pd # Sample dataset of press releases with duplicate entries data = { "title": [ "Mayor Announces New Park", "Mayor Announces New Park", "City Council Approves Budget", "City Council Approves Budget", "City Council Approves Budget", "Library Opens New Branch" ], "date": [ "2024-06-01", "2024-06-01", "2024-06-02", "2024-06-02", "2024-06-02", "2024-06-03" ] } df = pd.DataFrame(data) # Identify duplicate rows duplicates = df.duplicated() print("Duplicate rows:\n", df[duplicates]) # Remove duplicate rows df_cleaned = df.drop_duplicates() print("\nData after removing duplicates:\n", df_cleaned)
copy

Removing duplicates is crucial for accurate reporting because duplicate entries can inflate counts, misrepresent trends, or cause you to report the same event multiple times. By identifying and deleting repeated rows, as shown in the code above, you ensure that each press release is counted only once, leading to more trustworthy analysis and storytelling.

123456789101112131415161718192021222324252627
import pandas as pd # Sample dataset with missing values data = { "title": [ "Mayor Announces New Park", None, "City Council Approves Budget", "Library Opens New Branch" ], "date": [ "2024-06-01", "2024-06-02", None, "2024-06-03" ] } df = pd.DataFrame(data) # Fill missing values with a default value df_filled = df.fillna("Unknown") print("Data with missing values filled:\n", df_filled) # Alternatively, drop rows with any missing values df_dropped = df.dropna() print("\nData with incomplete rows dropped:\n", df_dropped)
copy

1. What is a common issue when working with real-world data?

2. Which pandas function removes duplicate rows from a DataFrame?

3. Fill in the blank: To fill missing values in a DataFrame, use _ _ _ _.

question mark

What is a common issue when working with real-world data?

Select the correct answer

question mark

Which pandas function removes duplicate rows from a DataFrame?

Select the correct answer

question-icon

Fill in the blank: To fill missing values in a DataFrame, use _ _ _ _.

Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 1. Hoofdstuk 4

Vraag AI

expand

Vraag AI

ChatGPT

Vraag wat u wilt of probeer een van de voorgestelde vragen om onze chat te starten.

bookCleaning Messy Data

Veeg om het menu te tonen

In journalism, data rarely arrives in a perfect, ready-to-use format. You will often encounter messy datasets filled with missing values, inconsistent formatting, and duplicate entries. These issues can distort your findings or even lead to incorrect conclusions if not handled properly. For example, a dataset of press releases from various organizations might include several versions of the same release, entries missing key information such as dates or authors, or inconsistent capitalization and date formats. Understanding how to identify and fix these problems is essential for producing reliable, accurate stories.

12345678910111213141516171819202122232425262728293031
import pandas as pd # Sample dataset of press releases with duplicate entries data = { "title": [ "Mayor Announces New Park", "Mayor Announces New Park", "City Council Approves Budget", "City Council Approves Budget", "City Council Approves Budget", "Library Opens New Branch" ], "date": [ "2024-06-01", "2024-06-01", "2024-06-02", "2024-06-02", "2024-06-02", "2024-06-03" ] } df = pd.DataFrame(data) # Identify duplicate rows duplicates = df.duplicated() print("Duplicate rows:\n", df[duplicates]) # Remove duplicate rows df_cleaned = df.drop_duplicates() print("\nData after removing duplicates:\n", df_cleaned)
copy

Removing duplicates is crucial for accurate reporting because duplicate entries can inflate counts, misrepresent trends, or cause you to report the same event multiple times. By identifying and deleting repeated rows, as shown in the code above, you ensure that each press release is counted only once, leading to more trustworthy analysis and storytelling.

123456789101112131415161718192021222324252627
import pandas as pd # Sample dataset with missing values data = { "title": [ "Mayor Announces New Park", None, "City Council Approves Budget", "Library Opens New Branch" ], "date": [ "2024-06-01", "2024-06-02", None, "2024-06-03" ] } df = pd.DataFrame(data) # Fill missing values with a default value df_filled = df.fillna("Unknown") print("Data with missing values filled:\n", df_filled) # Alternatively, drop rows with any missing values df_dropped = df.dropna() print("\nData with incomplete rows dropped:\n", df_dropped)
copy

1. What is a common issue when working with real-world data?

2. Which pandas function removes duplicate rows from a DataFrame?

3. Fill in the blank: To fill missing values in a DataFrame, use _ _ _ _.

question mark

What is a common issue when working with real-world data?

Select the correct answer

question mark

Which pandas function removes duplicate rows from a DataFrame?

Select the correct answer

question-icon

Fill in the blank: To fill missing values in a DataFrame, use _ _ _ _.

Was alles duidelijk?

Hoe kunnen we het verbeteren?

Bedankt voor je feedback!

Sectie 1. Hoofdstuk 4
some-alt