Aprenda Cleaning Messy Data | Data Collection and Cleaning for Journalists

Deslize para mostrar o menu

In journalism, data rarely arrives in a perfect, ready-to-use format. You will often encounter messy datasets filled with missing values, inconsistent formatting, and duplicate entries. These issues can distort your findings or even lead to incorrect conclusions if not handled properly. For example, a dataset of press releases from various organizations might include several versions of the same release, entries missing key information such as dates or authors, or inconsistent capitalization and date formats. Understanding how to identify and fix these problems is essential for producing reliable, accurate stories.


              12345678910111213141516171819202122232425262728293031
            
import pandas as pd

# Sample dataset of press releases with duplicate entries
data = {
    "title": [
        "Mayor Announces New Park",
        "Mayor Announces New Park",
        "City Council Approves Budget",
        "City Council Approves Budget",
        "City Council Approves Budget",
        "Library Opens New Branch"
    ],
    "date": [
        "2024-06-01",
        "2024-06-01",
        "2024-06-02",
        "2024-06-02",
        "2024-06-02",
        "2024-06-03"
    ]
}

df = pd.DataFrame(data)

# Identify duplicate rows
duplicates = df.duplicated()
print("Duplicate rows:\n", df[duplicates])

# Remove duplicate rows
df_cleaned = df.drop_duplicates()
print("\nData after removing duplicates:\n", df_cleaned)

Removing duplicates is crucial for accurate reporting because duplicate entries can inflate counts, misrepresent trends, or cause you to report the same event multiple times. By identifying and deleting repeated rows, as shown in the code above, you ensure that each press release is counted only once, leading to more trustworthy analysis and storytelling.


              123456789101112131415161718192021222324252627
            
import pandas as pd

# Sample dataset with missing values
data = {
    "title": [
        "Mayor Announces New Park",
        None,
        "City Council Approves Budget",
        "Library Opens New Branch"
    ],
    "date": [
        "2024-06-01",
        "2024-06-02",
        None,
        "2024-06-03"
    ]
}

df = pd.DataFrame(data)

# Fill missing values with a default value
df_filled = df.fillna("Unknown")
print("Data with missing values filled:\n", df_filled)

# Alternatively, drop rows with any missing values
df_dropped = df.dropna()
print("\nData with incomplete rows dropped:\n", df_dropped)

1. What is a common issue when working with real-world data?

2. Which pandas function removes duplicate rows from a DataFrame?

3. Fill in the blank: To fill missing values in a DataFrame, use _ _ _ _.

Tudo estava claro?

Obrigado pelo seu feedback!

Seção 1. Capítulo 4

Pergunte à IA

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Seção 1. Capítulo 4