Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Cleaning Messy Data | Data Collection and Cleaning for Journalists
Python for Journalists and Media

bookCleaning Messy Data

In journalism, data rarely arrives in a perfect, ready-to-use format. You will often encounter messy datasets filled with missing values, inconsistent formatting, and duplicate entries. These issues can distort your findings or even lead to incorrect conclusions if not handled properly. For example, a dataset of press releases from various organizations might include several versions of the same release, entries missing key information such as dates or authors, or inconsistent capitalization and date formats. Understanding how to identify and fix these problems is essential for producing reliable, accurate stories.

12345678910111213141516171819202122232425262728293031
import pandas as pd # Sample dataset of press releases with duplicate entries data = { "title": [ "Mayor Announces New Park", "Mayor Announces New Park", "City Council Approves Budget", "City Council Approves Budget", "City Council Approves Budget", "Library Opens New Branch" ], "date": [ "2024-06-01", "2024-06-01", "2024-06-02", "2024-06-02", "2024-06-02", "2024-06-03" ] } df = pd.DataFrame(data) # Identify duplicate rows duplicates = df.duplicated() print("Duplicate rows:\n", df[duplicates]) # Remove duplicate rows df_cleaned = df.drop_duplicates() print("\nData after removing duplicates:\n", df_cleaned)
copy

Removing duplicates is crucial for accurate reporting because duplicate entries can inflate counts, misrepresent trends, or cause you to report the same event multiple times. By identifying and deleting repeated rows, as shown in the code above, you ensure that each press release is counted only once, leading to more trustworthy analysis and storytelling.

123456789101112131415161718192021222324252627
import pandas as pd # Sample dataset with missing values data = { "title": [ "Mayor Announces New Park", None, "City Council Approves Budget", "Library Opens New Branch" ], "date": [ "2024-06-01", "2024-06-02", None, "2024-06-03" ] } df = pd.DataFrame(data) # Fill missing values with a default value df_filled = df.fillna("Unknown") print("Data with missing values filled:\n", df_filled) # Alternatively, drop rows with any missing values df_dropped = df.dropna() print("\nData with incomplete rows dropped:\n", df_dropped)
copy

1. What is a common issue when working with real-world data?

2. Which pandas function removes duplicate rows from a DataFrame?

3. Fill in the blank: To fill missing values in a DataFrame, use _ _ _ _.

question mark

What is a common issue when working with real-world data?

Select the correct answer

question mark

Which pandas function removes duplicate rows from a DataFrame?

Select the correct answer

question-icon

Fill in the blank: To fill missing values in a DataFrame, use _ _ _ _.

Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 1. Kapitel 4

Spørg AI

expand

Spørg AI

ChatGPT

Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat

Suggested prompts:

Can you explain when it's better to fill missing values versus dropping rows?

What are some other ways to handle missing data in a dataset?

How do missing values affect data analysis and reporting?

bookCleaning Messy Data

Stryg for at vise menuen

In journalism, data rarely arrives in a perfect, ready-to-use format. You will often encounter messy datasets filled with missing values, inconsistent formatting, and duplicate entries. These issues can distort your findings or even lead to incorrect conclusions if not handled properly. For example, a dataset of press releases from various organizations might include several versions of the same release, entries missing key information such as dates or authors, or inconsistent capitalization and date formats. Understanding how to identify and fix these problems is essential for producing reliable, accurate stories.

12345678910111213141516171819202122232425262728293031
import pandas as pd # Sample dataset of press releases with duplicate entries data = { "title": [ "Mayor Announces New Park", "Mayor Announces New Park", "City Council Approves Budget", "City Council Approves Budget", "City Council Approves Budget", "Library Opens New Branch" ], "date": [ "2024-06-01", "2024-06-01", "2024-06-02", "2024-06-02", "2024-06-02", "2024-06-03" ] } df = pd.DataFrame(data) # Identify duplicate rows duplicates = df.duplicated() print("Duplicate rows:\n", df[duplicates]) # Remove duplicate rows df_cleaned = df.drop_duplicates() print("\nData after removing duplicates:\n", df_cleaned)
copy

Removing duplicates is crucial for accurate reporting because duplicate entries can inflate counts, misrepresent trends, or cause you to report the same event multiple times. By identifying and deleting repeated rows, as shown in the code above, you ensure that each press release is counted only once, leading to more trustworthy analysis and storytelling.

123456789101112131415161718192021222324252627
import pandas as pd # Sample dataset with missing values data = { "title": [ "Mayor Announces New Park", None, "City Council Approves Budget", "Library Opens New Branch" ], "date": [ "2024-06-01", "2024-06-02", None, "2024-06-03" ] } df = pd.DataFrame(data) # Fill missing values with a default value df_filled = df.fillna("Unknown") print("Data with missing values filled:\n", df_filled) # Alternatively, drop rows with any missing values df_dropped = df.dropna() print("\nData with incomplete rows dropped:\n", df_dropped)
copy

1. What is a common issue when working with real-world data?

2. Which pandas function removes duplicate rows from a DataFrame?

3. Fill in the blank: To fill missing values in a DataFrame, use _ _ _ _.

question mark

What is a common issue when working with real-world data?

Select the correct answer

question mark

Which pandas function removes duplicate rows from a DataFrame?

Select the correct answer

question-icon

Fill in the blank: To fill missing values in a DataFrame, use _ _ _ _.

Var alt klart?

Hvordan kan vi forbedre det?

Tak for dine kommentarer!

Sektion 1. Kapitel 4
some-alt