Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Apprendre Getting Familiar with Dataset | Section
Machine Learning Foundations with Scikit-Learn

bookGetting Familiar with Dataset

Glissez pour afficher le menu

Begin preprocessing by exploring the dataset. Throughout this course, the penguin dataset will be used, with the task of predicting the species of a penguin.

There are three possible options, often referred to as classes in machine learning:

The features are: 'island', 'culmen_depth_mm', 'flipper_length_mm', 'body_mass_g', and 'sex'.

The dataset is stored in the penguins.csv file. It can be loaded from a link with the pd.read_csv() function to examine its contents:

12345
import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv') print(df.head(10))
copy

This dataset presents several issues that need to be addressed:

  • Missing data;
  • Categorical variables;
  • Different feature scales.

Missing Data

Most ML algorithms cannot process missing values directly, so these must be addressed before training. Missing values can either be removed or imputed (replaced with substitute values).

In pandas, empty cells are represented as NaN. Many ML models will raise an error if the dataset contains even a single NaN.

Categorical Data

The dataset includes categorical variables, which machine learning models are unable to process directly.

Categorical data must be encoded into numerical form.

Different Scales

'culmen_depth_mm' values range from 13.1 to 21.5, while 'body_mass_g' values range from 2700 to 6300. Because of that, some ML models may consider the 'body_mass_g' feature much more important than 'culmen_depth_mm'.

Scaling solves this problem. It will be covered in later chapters.

question-icon

Match the problem with a way to solve it.

Missing values –
Categorical data –

Different Scales –

Cliquez ou glissez-déposez des éléments et remplissez les blancs

Tout était clair ?

Comment pouvons-nous l'améliorer ?

Merci pour vos commentaires !

Section 1. Chapitre 7

Demandez à l'IA

expand

Demandez à l'IA

ChatGPT

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Section 1. Chapitre 7
some-alt