Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Categorical Features Encoding | Core Concepts
Cluster Analysis

bookCategorical Features Encoding

Clustering algorithms like K-means need numerical data. Categorical features must be converted to numerical form using encoding. You will learn about ordinal and one-hot encoding.

Ordinal Encoding

Ordinal encoding converts ordered categories to numerical values, preserving their rank. For example, ordinal encoding of the 'education_level' column will transform its values from "High School", "Bachelor's", "Master's", 'PhD' to 0, 1, 2, 3.

This assumes a meaningful numerical difference between encoded values, which may not always be accurate.

from sklearn.preprocessing import OrdinalEncoder

education_levels = [['High School',
                     "Bachelor's",
                     "Master's",
                     "PhD"]]
encoder = OrdinalEncoder(categories=education_levels)

df[['education_encoded']] = encoder.fit_transform(df[['education_level']]) 
Note
Note

Such encoding should only be used for ordinal features where category order matters.

One-Hot Encoding

One-hot encoding converts nominal (unordered) categories into binary columns, where each category becomes a new column. For a feature with n categories, this typically creates n columns β€” one column is 1 for the corresponding category, and the others are 0. However, only n-1 columns are actually needed to represent the information without redundancy.

For example, a 'color' column with values 'red', 'blue', and 'green' can be encoded with just two columns: 'color_red' and 'color_blue'. If a row has 0 in both, it implies the color is 'green'. By dropping one column, we avoid redundancy.

The removal of the redundant column is specified via drop='first':

from sklearn.preprocessing import OneHotEncoder 

encoder = OneHotEncoder(drop='first', sparse=False) 

encoded = encoder.fit_transform(df[['color']]) 
Note
Note

While one-hot encoding avoids imposing order and suits nominal features, it can increase data dimensionality.

question mark

Which encoding method is best suited for a categorical feature like 'country' with values such as "USA", "Canada", and "Germany", where there is no natural order?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 2

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Suggested prompts:

Can you explain the difference between ordinal and one-hot encoding in more detail?

When should I use ordinal encoding versus one-hot encoding?

Can you give more examples of categorical features and how to encode them?

Awesome!

Completion rate improved to 2.94

bookCategorical Features Encoding

Swipe to show menu

Clustering algorithms like K-means need numerical data. Categorical features must be converted to numerical form using encoding. You will learn about ordinal and one-hot encoding.

Ordinal Encoding

Ordinal encoding converts ordered categories to numerical values, preserving their rank. For example, ordinal encoding of the 'education_level' column will transform its values from "High School", "Bachelor's", "Master's", 'PhD' to 0, 1, 2, 3.

This assumes a meaningful numerical difference between encoded values, which may not always be accurate.

from sklearn.preprocessing import OrdinalEncoder

education_levels = [['High School',
                     "Bachelor's",
                     "Master's",
                     "PhD"]]
encoder = OrdinalEncoder(categories=education_levels)

df[['education_encoded']] = encoder.fit_transform(df[['education_level']]) 
Note
Note

Such encoding should only be used for ordinal features where category order matters.

One-Hot Encoding

One-hot encoding converts nominal (unordered) categories into binary columns, where each category becomes a new column. For a feature with n categories, this typically creates n columns β€” one column is 1 for the corresponding category, and the others are 0. However, only n-1 columns are actually needed to represent the information without redundancy.

For example, a 'color' column with values 'red', 'blue', and 'green' can be encoded with just two columns: 'color_red' and 'color_blue'. If a row has 0 in both, it implies the color is 'green'. By dropping one column, we avoid redundancy.

The removal of the redundant column is specified via drop='first':

from sklearn.preprocessing import OneHotEncoder 

encoder = OneHotEncoder(drop='first', sparse=False) 

encoded = encoder.fit_transform(df[['color']]) 
Note
Note

While one-hot encoding avoids imposing order and suits nominal features, it can increase data dimensionality.

question mark

Which encoding method is best suited for a categorical feature like 'country' with values such as "USA", "Canada", and "Germany", where there is no natural order?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 2
some-alt