Challenge 1
Compito
Swipe to start coding
In this challenge, you will need to work with the 'adult-census.csv'
dataset. It contains both categorical and numerical data. Your task will be to prepare the data for processing.
- Read the dataset
'adult-census.csv'
- Explore the dataset. Carefully check which character indicates the missed data in the dataset and replace it with the
np.nan
object - Remove rows with missing values
- Let's start with processing categorical data - columns
'workclass'
,'sex'
Use a one-hot encoding method to encode them - For numeric data (
'age'
,'hours-per-week'
), you will need to scale the data - Print processed data
Soluzione
99
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np
import pandas as pd
# Read the dataset
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/9c23bf60-276c-4989-a9d7-3091716b4507/datasets/adult-census.csv')
# Replace the symbol of missed data with real np.nan object
df = df.replace(' ?', np.nan)
# Drop all rows with nan values
df = df.dropna()
# Create MinMaxScaler
scaler = MinMaxScaler()
# Make one-hot encoding with pd.get_dummies()
one_hot = pd.get_dummies(df[['workclass', 'sex']])
# Join the encoded columns
df = df.join(one_hot)
# Drop initial columns
df = df.drop(['workclass', 'sex'], axis=1)
# Fit and transform numerical data for the scaler
df[['age', 'hours-per-week']] = scaler.fit_transform(df[['age', 'hours-per-week']])
# Print new data
print(df)
Tutto è chiaro?
Grazie per i tuoi commenti!
Sezione 6. Capitolo 1
99
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import numpy as np
import pandas as pd
# Read the dataset
df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/9c23bf60-276c-4989-a9d7-3091716b4507/datasets/adult-census.csv')
# Replace the symbol of missed data with real np.nan object
df = df.___(___, np.nan)
# Drop all rows with nan values
df = df.___()
# Create MinMaxScaler
scaler = MinMaxScaler()
# Make one-hot encoding with pd.get_dummies()
one_hot = pd.___(df[['workclass', 'sex']])
# Join the encoded columns
df = df.___(one_hot)
# Drop initial columns
df = df.___(['workclass', 'sex'], axis=1)
# Fit and transform numerical data for the scaler
df[['age', 'hours-per-week']] = scaler.___(df[['age', 'hours-per-week']])
# Print new data
print(df)
Chieda ad AI
Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione