Categorical Features Encoding
Clustering algorithms like K-means need numerical data. Categorical features must be converted to numerical form using encoding. You will learn about ordinal and one-hot encoding.
Ordinal Encoding
Ordinal encoding converts ordered categories to numerical values, preserving their rank. For example, ordinal encoding of the 'education_level'
column will transform its values from "High School"
, "Bachelor's"
, "Master's"
, 'PhD'
to 0
, 1
, 2
, 3
.
This assumes a meaningful numerical difference between encoded values, which may not always be accurate.
from sklearn.preprocessing import OrdinalEncoder
education_levels = [['High School',
"Bachelor's",
"Master's",
"PhD"]]
encoder = OrdinalEncoder(categories=education_levels)
df[['education_encoded']] = encoder.fit_transform(df[['education_level']])
One-Hot Encoding
One-hot encoding converts nominal (unordered) categories into binary columns, where each category becomes a new column. For a feature with n
categories, this typically creates n
columns β one column is 1
for the corresponding category, and the others are 0
. However, only n-1
columns are actually needed to represent the information without redundancy.
For example, a 'color'
column with values 'red'
, 'blue'
, and 'green'
can be encoded with just two columns: 'color_red'
and 'color_blue'
. If a row has 0
in both, it implies the color is 'green'
. By dropping one column, we avoid redundancy.
The removal of the redundant column is specified via drop='first'
:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse=False)
encoded = encoder.fit_transform(df[['color']])
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 2.94
Categorical Features Encoding
Swipe to show menu
Clustering algorithms like K-means need numerical data. Categorical features must be converted to numerical form using encoding. You will learn about ordinal and one-hot encoding.
Ordinal Encoding
Ordinal encoding converts ordered categories to numerical values, preserving their rank. For example, ordinal encoding of the 'education_level'
column will transform its values from "High School"
, "Bachelor's"
, "Master's"
, 'PhD'
to 0
, 1
, 2
, 3
.
This assumes a meaningful numerical difference between encoded values, which may not always be accurate.
from sklearn.preprocessing import OrdinalEncoder
education_levels = [['High School',
"Bachelor's",
"Master's",
"PhD"]]
encoder = OrdinalEncoder(categories=education_levels)
df[['education_encoded']] = encoder.fit_transform(df[['education_level']])
One-Hot Encoding
One-hot encoding converts nominal (unordered) categories into binary columns, where each category becomes a new column. For a feature with n
categories, this typically creates n
columns β one column is 1
for the corresponding category, and the others are 0
. However, only n-1
columns are actually needed to represent the information without redundancy.
For example, a 'color'
column with values 'red'
, 'blue'
, and 'green'
can be encoded with just two columns: 'color_red'
and 'color_blue'
. If a row has 0
in both, it implies the color is 'green'
. By dropping one column, we avoid redundancy.
The removal of the redundant column is specified via drop='first'
:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(drop='first', sparse=False)
encoded = encoder.fit_transform(df[['color']])
Thanks for your feedback!