Managing Categorical Variables | Preprocessing Data

Course Content

1. Getting Familiar With Indexing and Selecting Data
2. Dealing With Conditions
3. Extracting Data
4. Aggregating Data
5. Preprocessing Data

# Managing Categorical Variables

Now, you will work with the data set that doesn't contain missing values. The `NaN` values from the column `'Age'` were replaced with the mean of the column, and the `NaN` value from the `'Fare'` column was deleted. So, now it's time to learn how to manage categorical variables. Categorical means that they have some categories. For instance, in the column `'Sex'`, there is `'male'` and `'female'`; or in the column `'Embarked'`, there is `'Q'`, `'S'`, and `'C'`.

What should we do to calculate the number of values in each category or to find out information on them?

You already know `.loc[]`, `.isin()`, `.between()` and a lot of functions, but in pandas, there is a more beautiful and convenient way to do this. Use the function `.get_dummies()`. As an example, we will apply it to the column `'Embarked'`. Look at the implementation and the result (we will output 5 random passengers' names and new columns that we created).

Let's examine one of the possible outputs, specifically one of the possible combinations of five randomly selected rows. You can scroll horizontally through the table to view all the columns:

Explanation:

As a result, our function split the column `'Embarked'` into three columns: `'Embarked_C'`, and `'Embarked_Q'`, `'Embarked_S'`. In total, we have three categories. Each passenger has their category in the `'Embarked'` column. Thus, our function creates three columns corresponding to each category, and in line with each passenger, it fills the row of the column with `1` if the person was initially related to the geography; otherwise, it says `0`. Thus, we get `1` in just one column.

• `pd.get_dummies()` - this function converts categorical variables into dummy ones (1 or 0);
• `data` - the data frame that you want to use;
• `columns = ['Embarked']` - columns have categorical variables that you want to transform into dummy ones. Pay attention; it is obligatory to put column names into the list.

Your task here is to transform the column `'Sex'` into one with dummy variables instead of categorical ones. Then output the sum of the values in each category.

Your task here is to transform the column `'Sex'` into one with dummy variables instead of categorical ones. Then output the sum of the values in each category.

Everything was clear?

Section 5. Chapter 6

# Managing Categorical Variables

Now, you will work with the data set that doesn't contain missing values. The `NaN` values from the column `'Age'` were replaced with the mean of the column, and the `NaN` value from the `'Fare'` column was deleted. So, now it's time to learn how to manage categorical variables. Categorical means that they have some categories. For instance, in the column `'Sex'`, there is `'male'` and `'female'`; or in the column `'Embarked'`, there is `'Q'`, `'S'`, and `'C'`.

What should we do to calculate the number of values in each category or to find out information on them?

You already know `.loc[]`, `.isin()`, `.between()` and a lot of functions, but in pandas, there is a more beautiful and convenient way to do this. Use the function `.get_dummies()`. As an example, we will apply it to the column `'Embarked'`. Look at the implementation and the result (we will output 5 random passengers' names and new columns that we created).

Let's examine one of the possible outputs, specifically one of the possible combinations of five randomly selected rows. You can scroll horizontally through the table to view all the columns:

Explanation:

As a result, our function split the column `'Embarked'` into three columns: `'Embarked_C'`, and `'Embarked_Q'`, `'Embarked_S'`. In total, we have three categories. Each passenger has their category in the `'Embarked'` column. Thus, our function creates three columns corresponding to each category, and in line with each passenger, it fills the row of the column with `1` if the person was initially related to the geography; otherwise, it says `0`. Thus, we get `1` in just one column.

• `pd.get_dummies()` - this function converts categorical variables into dummy ones (1 or 0);
• `data` - the data frame that you want to use;
• `columns = ['Embarked']` - columns have categorical variables that you want to transform into dummy ones. Pay attention; it is obligatory to put column names into the list.

Your task here is to transform the column `'Sex'` into one with dummy variables instead of categorical ones. Then output the sum of the values in each category.

Your task here is to transform the column `'Sex'` into one with dummy variables instead of categorical ones. Then output the sum of the values in each category.

Everything was clear?

Section 5. Chapter 6

# Managing Categorical Variables

Now, you will work with the data set that doesn't contain missing values. The `NaN` values from the column `'Age'` were replaced with the mean of the column, and the `NaN` value from the `'Fare'` column was deleted. So, now it's time to learn how to manage categorical variables. Categorical means that they have some categories. For instance, in the column `'Sex'`, there is `'male'` and `'female'`; or in the column `'Embarked'`, there is `'Q'`, `'S'`, and `'C'`.

What should we do to calculate the number of values in each category or to find out information on them?

You already know `.loc[]`, `.isin()`, `.between()` and a lot of functions, but in pandas, there is a more beautiful and convenient way to do this. Use the function `.get_dummies()`. As an example, we will apply it to the column `'Embarked'`. Look at the implementation and the result (we will output 5 random passengers' names and new columns that we created).

Let's examine one of the possible outputs, specifically one of the possible combinations of five randomly selected rows. You can scroll horizontally through the table to view all the columns:

Explanation:

As a result, our function split the column `'Embarked'` into three columns: `'Embarked_C'`, and `'Embarked_Q'`, `'Embarked_S'`. In total, we have three categories. Each passenger has their category in the `'Embarked'` column. Thus, our function creates three columns corresponding to each category, and in line with each passenger, it fills the row of the column with `1` if the person was initially related to the geography; otherwise, it says `0`. Thus, we get `1` in just one column.

• `pd.get_dummies()` - this function converts categorical variables into dummy ones (1 or 0);
• `data` - the data frame that you want to use;
• `columns = ['Embarked']` - columns have categorical variables that you want to transform into dummy ones. Pay attention; it is obligatory to put column names into the list.

Your task here is to transform the column `'Sex'` into one with dummy variables instead of categorical ones. Then output the sum of the values in each category.

Your task here is to transform the column `'Sex'` into one with dummy variables instead of categorical ones. Then output the sum of the values in each category.

Everything was clear?

Now, you will work with the data set that doesn't contain missing values. The `NaN` values from the column `'Age'` were replaced with the mean of the column, and the `NaN` value from the `'Fare'` column was deleted. So, now it's time to learn how to manage categorical variables. Categorical means that they have some categories. For instance, in the column `'Sex'`, there is `'male'` and `'female'`; or in the column `'Embarked'`, there is `'Q'`, `'S'`, and `'C'`.

What should we do to calculate the number of values in each category or to find out information on them?

You already know `.loc[]`, `.isin()`, `.between()` and a lot of functions, but in pandas, there is a more beautiful and convenient way to do this. Use the function `.get_dummies()`. As an example, we will apply it to the column `'Embarked'`. Look at the implementation and the result (we will output 5 random passengers' names and new columns that we created).

Let's examine one of the possible outputs, specifically one of the possible combinations of five randomly selected rows. You can scroll horizontally through the table to view all the columns:

Explanation:

As a result, our function split the column `'Embarked'` into three columns: `'Embarked_C'`, and `'Embarked_Q'`, `'Embarked_S'`. In total, we have three categories. Each passenger has their category in the `'Embarked'` column. Thus, our function creates three columns corresponding to each category, and in line with each passenger, it fills the row of the column with `1` if the person was initially related to the geography; otherwise, it says `0`. Thus, we get `1` in just one column.

• `pd.get_dummies()` - this function converts categorical variables into dummy ones (1 or 0);
• `data` - the data frame that you want to use;
• `columns = ['Embarked']` - columns have categorical variables that you want to transform into dummy ones. Pay attention; it is obligatory to put column names into the list.

Your task here is to transform the column `'Sex'` into one with dummy variables instead of categorical ones. Then output the sum of the values in each category.