Summary
This chapter covers data preprocessing techniques using pandas, including handling missing values, encoding categorical variables, and preparing features for consistent scaling.

General domain of usage
Machine learning

Begin preprocessing by exploring the dataset. Throughout this course, the **penguin dataset** will be used, with the task of predicting the species of a penguin.


There are three possible options, often referred to as **classes** in machine learning:

The features are: `'island'`, `'culmen_depth_mm'`, `'flipper_length_mm'`, `'body_mass_g'`, and `'sex'`.

The dataset is stored in the `penguins.csv` file. It can be loaded from a link with the `pd.read_csv()` function to examine its contents:


import pandas as pd

df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv')

print(df.head(10))

This dataset presents several issues that need to be addressed:

* Missing data;
* Categorical variables;
* Different feature scales.


## Missing Data

Most ML algorithms cannot process missing values directly, so these must be addressed before training. Missing values can either be **removed** or **imputed** (replaced with substitute values).

In `pandas`, empty cells are represented as `NaN`. Many ML models will raise an error if the dataset contains even a single `NaN`.


## Categorical Data

The dataset includes categorical variables, which machine learning models are unable to process directly.


Categorical data must be **encoded** into numerical form.


## Different Scales

`'culmen_depth_mm'` values range from 13.1 to 21.5, while `'body_mass_g'` values range from 2700 to 6300. Because of that, some ML models may consider the `'body_mass_g'` feature **much more important** than `'culmen_depth_mm'`.

**Scaling** solves this problem. It will be covered in later chapters.

Machine learning is now used everywhere. Want to learn it yourself? This course is an introduction to the world of Machine learning for you to learn basic concepts, work with Scikit-learn – the most popular library for ML and build your first Machine Learning project.
This course is intended for students with a basic knowledge of Python, Pandas, and Numpy.

Learn the Machine Learning concepts and the ML project workflow.

Preprocessing is probably the most important stage of an ML project. This chapter covers the preprocessing steps needed for almost any dataset.

A pipeline is a neat way to combine all the preprocessing steps as well as a model. Pipelines make it much easier to train and use a model.

Modeling is the most fun stage of an ML project. Let's learn to build, fine-tune and evaluate the model!

Getting Familiar with Dataset

Missing Data

Categorical Data

Different Scales