One-Hot Encoder
When it comes to nominal values, handling them is a bit more complex.
Let's consider a feature containing ordinal data, such as user ratings. Its values range from 'Terrible' to 'Great'. It makes sense to encode these ratings as numbers from 0 to 4 because the ML model will recognize the inherent order.
Now, consider a feature labeled 'city'
with five distinct cities. Encoding them as numbers from 0 to 4 would mistakenly imply a logical order to the ML model, which doesn’t actually exist. Therefore, a more suitable approach is to use one-hot encoding, which avoids implying any false order.
To encode nominal data, the OneHotEncoder
transformer is used. It creates a column for each unique value. Then for each row, it sets 1 to the column of this row's value and 0 to other columns.
What was originally 'NewYork' now has 1 in the 'City_NewYork'
column and 0 in other City_
columns.
Let's use OneHotEncoder
on our penguins dataset! There are two nominal features, 'island'
and 'sex'
(not counting 'species'
, we will learn how to deal with target encoding in the next chapter).
123456import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_imputed.csv') print('island: ', df['island'].unique()) print('sex: ', df['sex'].unique())
To use OneHotEncoder
, you just need to initialize an object and pass columns to the .fit_transform()
like with any other transformer.
1234567891011import pandas as pd from sklearn.preprocessing import OneHotEncoder df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_imputed.csv') # Assign X, y variables y = df['species'] X = df.drop('species', axis=1) # Initialize an OneHotEncoder object one_hot = OneHotEncoder() # Print transformed 'sex', 'island' columns print(one_hot.fit_transform(X[['sex', 'island']]).toarray())
Tak for dine kommentarer!
Spørg AI
Spørg AI
Spørg om hvad som helst eller prøv et af de foreslåede spørgsmål for at starte vores chat
Awesome!
Completion rate improved to 3.13
One-Hot Encoder
Stryg for at vise menuen
When it comes to nominal values, handling them is a bit more complex.
Let's consider a feature containing ordinal data, such as user ratings. Its values range from 'Terrible' to 'Great'. It makes sense to encode these ratings as numbers from 0 to 4 because the ML model will recognize the inherent order.
Now, consider a feature labeled 'city'
with five distinct cities. Encoding them as numbers from 0 to 4 would mistakenly imply a logical order to the ML model, which doesn’t actually exist. Therefore, a more suitable approach is to use one-hot encoding, which avoids implying any false order.
To encode nominal data, the OneHotEncoder
transformer is used. It creates a column for each unique value. Then for each row, it sets 1 to the column of this row's value and 0 to other columns.
What was originally 'NewYork' now has 1 in the 'City_NewYork'
column and 0 in other City_
columns.
Let's use OneHotEncoder
on our penguins dataset! There are two nominal features, 'island'
and 'sex'
(not counting 'species'
, we will learn how to deal with target encoding in the next chapter).
123456import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_imputed.csv') print('island: ', df['island'].unique()) print('sex: ', df['sex'].unique())
To use OneHotEncoder
, you just need to initialize an object and pass columns to the .fit_transform()
like with any other transformer.
1234567891011import pandas as pd from sklearn.preprocessing import OneHotEncoder df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins_imputed.csv') # Assign X, y variables y = df['species'] X = df.drop('species', axis=1) # Initialize an OneHotEncoder object one_hot = OneHotEncoder() # Print transformed 'sex', 'island' columns print(one_hot.fit_transform(X[['sex', 'island']]).toarray())
Tak for dine kommentarer!