Apprendre Data Cleaning and Normalization | Data Transformation and Loading

Glissez pour afficher le menu


              1234567891011121314151617181920212223
            
import pandas as pd

# Sample data with missing values, duplicates, and inconsistent column names
data = {
    "First Name": ["Alice", "Bob", "Charlie", "Bob", None],
    "last name": ["Smith", "Jones", "Brown", "Jones", "Williams"],
    "Age": [25, None, 35, None, 28],
    "Email": ["alice@example.com", "bob@example.com", "charlie@example.com", "bob@example.com", None]
}

df = pd.DataFrame(data)

# 1. Standardize column names: lowercase and replace spaces with underscores
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

# 2. Handle missing values: fill missing 'age' with median, drop rows missing 'first_name' or 'email'
df['age'] = df['age'].fillna(df['age'].median())
df = df.dropna(subset=['first_name', 'email'])

# 3. Remove duplicates based on all columns
df = df.drop_duplicates()

print(df)

When preparing data for analysis or loading, you must ensure the data is clean, consistent, and ready for transformation. Data cleaning involves identifying and handling missing values, removing duplicate records, and standardizing column names. Using the pandas library, you can efficiently perform these tasks to improve data quality and reliability.

Normalization is the process of adjusting values measured on different scales to a common scale. In data pipelines, normalization can also refer to making sure data types are correct and values are consistent. For example, you may convert all column names to lowercase and replace spaces with underscores for uniformity. This helps prevent errors when merging or querying data later.

Data types are also critical for clean data. Always check that columns have appropriate types (such as numeric, string, or datetime) before loading data into a downstream system. Use pandas methods like astype() to enforce correct types as needed.

Best practices for clean data include:

Standardizing column names for consistency;
Filling or dropping missing values based on the context and analysis requirements;
Removing duplicate records to avoid skewed results;
Ensuring all columns have the correct data types;
Documenting any cleaning steps for reproducibility.

By following these steps, you set a strong foundation for reliable analysis and smooth data loading in your pipeline.

Tout était clair ?

Merci pour vos commentaires !

Section 3. Chapitre 1

Demandez à l'IA

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Section 3. Chapitre 1