Summary  
This chapter covers how to inspect and summarize a tabular dataset by printing sample records, column types, non-null counts, and descriptive statistics to surface issues like missing values, duplicates, and outliers.

General domain of usage  
Machine learning data preprocessing

When you work with machine learning, the quality of your data is one of the most important factors in building effective models. **High-quality data** allows algorithms to learn accurate patterns, while **poor data** can lead to misleading results, wasted resources, and unreliable predictions. Raw datasets almost always contain issues that must be addressed before you can trust the outcomes of your analysis.

**Data quality** measures how accurately and completely a dataset reflects the real world. High-quality data is essential because machine learning models rely on accurate, consistent, and relevant information for reliable predictions.

Definition: Data Quality

Common data quality problems include **missing values**, where some entries in a dataset are empty; **duplicate records**, which can bias results or inflate the importance of certain data points; and **outliers**, which are values that are unusually high or low compared to the rest of the data. Other issues may involve **inconsistent formatting**, **incorrect data types**, or errors introduced during data collection. Each of these problems can distort the patterns that machine learning models try to learn, leading to poor performance or unexpected behavior.

import pandas as pd

# Load a sample dataset from seaborn
import seaborn as sns
df = sns.load_dataset('titanic')

# Display the first few rows
print("Head of dataset:")
print(df.head())

# Show basic information about the dataset
print("\nInfo:")
print(df.info())

# Show summary statistics for numerical columns
print("\nDescribe:")
print(df.describe())

When reviewing `df.describe()`, focus on **minimum and maximum values**, **counts**, and **standard deviation**. Unusual values or mismatched counts can reveal **missing data**, **outliers**, or **inconsistent entries** that need cleaning.

Interpreting Summary Statistics

Which of the following is NOT a common data quality issue you might find in a raw dataset

Covers essential techniques for preparing raw data for supervised learning. Focuses on handling missing values, encoding categorical features, scaling and transforming numerical data, and creating meaningful features that improve model performance and reliability.


Understanding Data Quality