Summary  
This chapter explains how to calculate the percentage of missing values in each DataFrame column and remove columns with too many NaNs using pandas’ drop(columns=…, inplace=True) method.  

General domain of usage  
Data preprocessing in data analysis

In this video, you will learn how to handle columns with a large number of missing values in a pandas DataFrame. You will see why it is sometimes better to remove an entire column, like 'Cabin' in the Titanic dataset, rather than trying to fill in missing values. The video demonstrates how to use the `.drop()` method in pandas to delete columns, explains the key arguments like `columns` and `inplace`, and walks through a practical example using real data. By the end, you will know how to identify and efficiently remove columns that are not useful for your analysis due to excessive NaN values.

In the previous chapter, you received the result:

|||
|---|---|
|PassengerId|0|
|Survived|0|
|Pclass|0|
|Name|0|
|Sex|0|
|Age|86|
|SibSp|0|
|Parch|0|
|Ticket|0|
|Fare|1|
|Cabin|327|
|Embarked|0|


The dataset has 418 rows. Look at the column `Cabin`, where we have `327` missing values. There is no sense filling them in because we have minimal information here. So, in this case, the best solution is to delete the column that is senseless to us. One of the reasons is that we can delete only the rows that contain missing values, but we can't delete 327 rows out of 418. So, let's figure out how to do this.

To delete a column, you must apply the method `.drop()` to the data set. The syntax is the following:

```python
# If you want to delete one column
data.drop(columns = 'column_name', inplace = True)

# If you want to delete several columns
data.drop(columns = ['column_1', 'column_2'], inplace = True)
```

**Explanation:**
- `.drop()` - a method that deletes columns;
- `columns = 'column_name'` or  `columns = ['column_1', 'column_2']` - argument of the function, where you specify the name or names of columns that you want to delete;
- `inplace = True` - useful argument of pandas that allows us to save all changes. You can use it in other functions too; we will learn some of them later on.

import unittest
import pandas as pd
import io
import sys


def _dynamic_test(test_case, condition, success_msg, failure_msg):
    if condition:
        test_case._testMethodName = success_msg
        test_case.assertTrue(True, success_msg)
    else:
        test_case._testMethodName = failure_msg
        test_case.fail(failure_msg)


class TestDropCabin(unittest.TestCase):
    def test_cabin_column_removed(self):
        """
        1. Check that the column 'Cabin' was removed from the DataFrame.
        """
        import user_code

        # load original dataset
        url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/4bf24830-59ba-4418-969b-aaf8117d522e/titanic_0.csv"
        df_original = pd.read_csv(url)

        # after user modifications
        assert hasattr(user_code, "data"), "Variable 'data' not found."
        df_user = user_code.data

        condition = "Cabin" not in df_user.columns and isinstance(df_user, pd.DataFrame)
        _dynamic_test(
            self,
            condition,
            "The column 'Cabin' has been successfully removed from the DataFrame.",
            "The 'Cabin' column was not removed. Use data.drop(columns='Cabin', inplace=True)."
        )


class TestOutput(unittest.TestCase):
    def test_sample_output(self):
        """
        2. Check that 5 random rows of the modified DataFrame are printed.
        """
        import user_code
        captured_output = io.StringIO()
        sys.stdout = captured_output

        # run the print statement again if needed
        if hasattr(user_code, "data"):
            print(user_code.data.sample(5))
        sys.stdout = sys.__stdout__

        output_text = captured_output.getvalue().strip()
        condition = len(output_text) > 0 and len(output_text.splitlines()) > 1
        _dynamic_test(
            self,
            condition,
            "The output displays 5 random rows of the modified DataFrame.",
            "The output is missing or incorrect. Use print(data.sample(5))."
        )


if __name__ == "__main__":
    unittest.main()


test_code.py

This course contains a lot of useful functions for a future data analyst. You will learn different ways of extracting data and even set conditions on it. After it, you will be familiar with the methods of grouping data. Also, you will learn how to preprocess data. Each section has its data set so that the course will be gripping.

This section will teach you how to output specific columns by their titles or indices. Also, you will get acquainted with the ways you can select rows  by indices.

Here, you will learn how to extract data that has specific conditions. Also, you will learn how to combine them and even create your own.

In this section, you will expand your knowledge on setting different data conditions. You will learn to check if your data is in a defined list of values or between two values. You will also learn how to find the largest and smallest values.

This section is one of the most fascinating of the course. Here, you will learn how to group data in different ways. It will help you work as a data analyst to find out information on specific data groups.

This section is one of the most significant for a data analyst because if the data contains missing data values in the incorrect format, it will be impossible to work with. Thus, you will learn how to deal with such inappropriate values here. 

What Will We Do With the NaN Values?

Solution


PassengerId	0
Survived	0
Pclass	0
Name	0
Sex	0
Age	86
SibSp	0
Parch	0
Ticket	0
Fare	1
Cabin	327
Embarked	0