Apprendre Automating Data Cleaning for Consistency | Automating Government Workflows with Python

Python for Government Analysts

Glissez pour afficher le menu

Government datasets often come with a range of data quality issues that can create challenges for analysis and reporting. Some of the most common problems include inconsistent capitalization of text fields, such as region or department names, and missing values in key columns. These inconsistencies can lead to errors in summary statistics, duplicated records, and unreliable policy insights. Recognizing and addressing these issues is a critical first step in any data-driven government project.


              12345678910
            
# Example dataset: a list of dictionaries representing records with inconsistent region names and missing values
data = [
    {'region': 'North', 'population': 120000, 'budget': 3000000},
    {'region': 'north', 'population': None, 'budget': 2800000},
    {'region': 'SOUTH', 'population': 95000, 'budget': None},
    {'region': 'East', 'population': 80000, 'budget': 2200000},
    {'region': '', 'population': 110000, 'budget': 2500000},
    {'region': 'south', 'population': 91000, 'budget': 2100000},
    {'region': None, 'population': 105000, 'budget': 2400000}
]

To prepare your data for analysis, you can use Python to automate the process of cleaning and standardizing fields. For text fields like region, converting all entries to the same case (for example, all lowercase or title case) helps ensure consistency. For missing values, you can choose to fill them with a default value, such as the mean or median for numerical fields, or a placeholder like "Unknown" for text fields. This standardization makes it easier to group, summarize, and analyze your data accurately.


              1234567891011121314151617
            
# Cleaning and standardizing the dataset

import pandas as pd

# Load the data into a DataFrame
df = pd.DataFrame(data)

# Standardize the 'region' column: strip whitespace, fill missing/empty with 'Unknown', and capitalize
df['region'] = df['region'].fillna('').str.strip()
df['region'] = df['region'].replace('', 'Unknown')
df['region'] = df['region'].str.title()

# Fill missing numerical values with the column mean
df['population'] = df['population'].fillna(df['population'].mean())
df['budget'] = df['budget'].fillna(df['budget'].mean())

print(df)

Tout était clair ?

Merci pour vos commentaires !

Section 3. Chapitre 6

Demandez à l'IA

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Section 3. Chapitre 6

Automating Data Cleaning for Consistency

1. Why is data cleaning important before analysis?

2. What is a common method for handling missing values in Python?

3. How can you standardize text fields in a dataset?