Course Content

Advanced Techniques in pandas

## Advanced Techniques in pandas

# Managing Categorical Variables

Now, you will work with the data set that doesn't contain missing values. The `NaN`

values from the column `'Age'`

were replaced with the **mean** of the column, and the `NaN`

value from the `'Fare'`

column was deleted.
So, now it's time to learn how to manage categorical variables. Categorical means that they have some categories. For instance, in the column `'Sex'`

, there is `'male'`

and `'female'`

; or in the column `'Embarked'`

, there is `'Q'`

, `'S'`

, and `'C'`

.

**What should we do to calculate the number of values in each category or to find out information on them?**

You already know `.loc[]`

, `.isin()`

, `.between()`

and a lot of functions, but in pandas, there is a more beautiful and convenient way to do this. Use the function `.get_dummies()`

. As an example, we will apply it to the column `'Embarked'`

. Look at the implementation and the result (we will output 5 random passengers' names and new columns that we created).

Let's examine one of the possible outputs, specifically one of the possible combinations of **five randomly selected rows**. You can scroll horizontally through the table to view all the columns:

**Explanation:**

As a result, our function split the column `'Embarked'`

into three columns: `'Embarked_C'`

, and `'Embarked_Q'`

, `'Embarked_S'`

. In total, we have three categories. Each passenger has their category in the `'Embarked'`

column. Thus, our function creates three columns corresponding to each category, and in line with each passenger, it fills the row of the column with `1`

if the person was initially related to the geography; otherwise, it says `0`

. Thus, we get `1`

in just one column.

`pd.get_dummies()`

- this function converts**categorical**variables into**dummy**ones (1 or 0);`data`

- the data frame that you want to use;`columns = ['Embarked']`

- columns have categorical variables that you want to transform into dummy ones. Pay attention; it is**obligatory**to put column names into the list.

Task

Your task here is to transform the column `'Sex'`

into one with dummy variables instead of categorical ones. Then output the **sum** of the values in each category.

Task

Your task here is to transform the column `'Sex'`

into one with dummy variables instead of categorical ones. Then output the **sum** of the values in each category.

Everything was clear?

# Managing Categorical Variables

Now, you will work with the data set that doesn't contain missing values. The `NaN`

values from the column `'Age'`

were replaced with the **mean** of the column, and the `NaN`

value from the `'Fare'`

column was deleted.
So, now it's time to learn how to manage categorical variables. Categorical means that they have some categories. For instance, in the column `'Sex'`

, there is `'male'`

and `'female'`

; or in the column `'Embarked'`

, there is `'Q'`

, `'S'`

, and `'C'`

.

**What should we do to calculate the number of values in each category or to find out information on them?**

You already know `.loc[]`

, `.isin()`

, `.between()`

and a lot of functions, but in pandas, there is a more beautiful and convenient way to do this. Use the function `.get_dummies()`

. As an example, we will apply it to the column `'Embarked'`

. Look at the implementation and the result (we will output 5 random passengers' names and new columns that we created).

Let's examine one of the possible outputs, specifically one of the possible combinations of **five randomly selected rows**. You can scroll horizontally through the table to view all the columns:

**Explanation:**

As a result, our function split the column `'Embarked'`

into three columns: `'Embarked_C'`

, and `'Embarked_Q'`

, `'Embarked_S'`

. In total, we have three categories. Each passenger has their category in the `'Embarked'`

column. Thus, our function creates three columns corresponding to each category, and in line with each passenger, it fills the row of the column with `1`

if the person was initially related to the geography; otherwise, it says `0`

. Thus, we get `1`

in just one column.

`pd.get_dummies()`

- this function converts**categorical**variables into**dummy**ones (1 or 0);`data`

- the data frame that you want to use;`columns = ['Embarked']`

- columns have categorical variables that you want to transform into dummy ones. Pay attention; it is**obligatory**to put column names into the list.

Task

Your task here is to transform the column `'Sex'`

into one with dummy variables instead of categorical ones. Then output the **sum** of the values in each category.

Task

`'Sex'`

into one with dummy variables instead of categorical ones. Then output the **sum** of the values in each category.

Everything was clear?

# Managing Categorical Variables

Now, you will work with the data set that doesn't contain missing values. The `NaN`

values from the column `'Age'`

were replaced with the **mean** of the column, and the `NaN`

value from the `'Fare'`

column was deleted.
So, now it's time to learn how to manage categorical variables. Categorical means that they have some categories. For instance, in the column `'Sex'`

, there is `'male'`

and `'female'`

; or in the column `'Embarked'`

, there is `'Q'`

, `'S'`

, and `'C'`

.

**What should we do to calculate the number of values in each category or to find out information on them?**

You already know `.loc[]`

, `.isin()`

, `.between()`

and a lot of functions, but in pandas, there is a more beautiful and convenient way to do this. Use the function `.get_dummies()`

. As an example, we will apply it to the column `'Embarked'`

. Look at the implementation and the result (we will output 5 random passengers' names and new columns that we created).

Let's examine one of the possible outputs, specifically one of the possible combinations of **five randomly selected rows**. You can scroll horizontally through the table to view all the columns:

**Explanation:**

As a result, our function split the column `'Embarked'`

into three columns: `'Embarked_C'`

, and `'Embarked_Q'`

, `'Embarked_S'`

. In total, we have three categories. Each passenger has their category in the `'Embarked'`

column. Thus, our function creates three columns corresponding to each category, and in line with each passenger, it fills the row of the column with `1`

if the person was initially related to the geography; otherwise, it says `0`

. Thus, we get `1`

in just one column.

`pd.get_dummies()`

- this function converts**categorical**variables into**dummy**ones (1 or 0);`data`

- the data frame that you want to use;`columns = ['Embarked']`

- columns have categorical variables that you want to transform into dummy ones. Pay attention; it is**obligatory**to put column names into the list.

Task

`'Sex'`

into one with dummy variables instead of categorical ones. Then output the **sum** of the values in each category.

Task

`'Sex'`

into one with dummy variables instead of categorical ones. Then output the **sum** of the values in each category.

Everything was clear?

`NaN`

values from the column `'Age'`

were replaced with the **mean** of the column, and the `NaN`

value from the `'Fare'`

column was deleted.
So, now it's time to learn how to manage categorical variables. Categorical means that they have some categories. For instance, in the column `'Sex'`

, there is `'male'`

and `'female'`

; or in the column `'Embarked'`

, there is `'Q'`

, `'S'`

, and `'C'`

.

`.loc[]`

, `.isin()`

, `.between()`

and a lot of functions, but in pandas, there is a more beautiful and convenient way to do this. Use the function `.get_dummies()`

. As an example, we will apply it to the column `'Embarked'`

. Look at the implementation and the result (we will output 5 random passengers' names and new columns that we created).

**five randomly selected rows**. You can scroll horizontally through the table to view all the columns:

**Explanation:**

`'Embarked'`

into three columns: `'Embarked_C'`

, and `'Embarked_Q'`

, `'Embarked_S'`

. In total, we have three categories. Each passenger has their category in the `'Embarked'`

column. Thus, our function creates three columns corresponding to each category, and in line with each passenger, it fills the row of the column with `1`

if the person was initially related to the geography; otherwise, it says `0`

. Thus, we get `1`

in just one column.

`pd.get_dummies()`

- this function converts**categorical**variables into**dummy**ones (1 or 0);`data`

- the data frame that you want to use;`columns = ['Embarked']`

- columns have categorical variables that you want to transform into dummy ones. Pay attention; it is**obligatory**to put column names into the list.

Task

`'Sex'`

into one with dummy variables instead of categorical ones. Then output the **sum** of the values in each category.