Summary  
This chapter explains how to calculate the percentage of missing values in each DataFrame column and remove columns with too many NaNs using pandas’ drop(columns=…, inplace=True) method.  

General domain of usage  
Data preprocessing in data analysis

In this video, you will learn how to handle columns with a large number of missing values in a pandas DataFrame. You will see why it is sometimes better to remove an entire column, like 'Cabin' in the Titanic dataset, rather than trying to fill in missing values. The video demonstrates how to use the `.drop()` method in pandas to delete columns, explains the key arguments like `columns` and `inplace`, and walks through a practical example using real data. By the end, you will know how to identify and efficiently remove columns that are not useful for your analysis due to excessive NaN values.

前の章では、次の結果を得ました：

|||
|---|---|
|PassengerId|0|
|Survived|0|
|Pclass|0|
|Name|0|
|Sex|0|
|Age|86|
|SibSp|0|
|Parch|0|
|Ticket|0|
|Fare|1|
|Cabin|327|
|Embarked|0|


このデータセットは418行あります。`Cabin`列を見ると、`327`件の欠損値があります。ここでは情報がほとんどないため、これらを補完する意味はありません。この場合、最適な解決策は、私たちにとって意味のない列を削除することです。理由の一つとして、欠損値を含む行だけを削除することもできますが、418行中327行を削除することはできません。では、これをどのように行うか見ていきましょう。

列を削除するには、データセットに対して`.drop()`メソッドを適用します。構文は以下の通りです：

```python
# If you want to delete one column
data.drop(columns = 'column_name', inplace = True)

# If you want to delete several columns
data.drop(columns = ['column_1', 'column_2'], inplace = True)
```

**説明：**
- `.drop()` - 列を削除するメソッド；
- `columns = 'column_name'` または  `columns = ['column_1', 'column_2']` - 削除したい列名を指定する引数；
- `inplace = True` - すべての変更を保存できるpandasの便利な引数。他の関数でも使用可能で、今後いくつか学びます。

import unittest
import pandas as pd
import io
import sys


def _dynamic_test(test_case, condition, success_msg, failure_msg):
    if condition:
        test_case._testMethodName = success_msg
        test_case.assertTrue(True, success_msg)
    else:
        test_case._testMethodName = failure_msg
        test_case.fail(failure_msg)


class TestDropCabin(unittest.TestCase):
    def test_cabin_column_removed(self):
        """
        1. Check that the column 'Cabin' was removed from the DataFrame.
        """
        import user_code

        # load original dataset
        url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/4bf24830-59ba-4418-969b-aaf8117d522e/titanic_0.csv"
        df_original = pd.read_csv(url)

        # after user modifications
        assert hasattr(user_code, "data"), "Variable 'data' not found."
        df_user = user_code.data

        condition = "Cabin" not in df_user.columns and isinstance(df_user, pd.DataFrame)
        _dynamic_test(
            self,
            condition,
            "The column 'Cabin' has been successfully removed from the DataFrame.",
            "The 'Cabin' column was not removed. Use data.drop(columns='Cabin', inplace=True)."
        )


class TestOutput(unittest.TestCase):
    def test_sample_output(self):
        """
        2. Check that 5 random rows of the modified DataFrame are printed.
        """
        import user_code
        captured_output = io.StringIO()
        sys.stdout = captured_output

        # run the print statement again if needed
        if hasattr(user_code, "data"):
            print(user_code.data.sample(5))
        sys.stdout = sys.__stdout__

        output_text = captured_output.getvalue().strip()
        condition = len(output_text) > 0 and len(output_text.splitlines()) > 1
        _dynamic_test(
            self,
            condition,
            "The output displays 5 random rows of the modified DataFrame.",
            "The output is missing or incorrect. Use print(data.sample(5))."
        )


if __name__ == "__main__":
    unittest.main()


test_code.py

このコースは、将来のデータアナリストのために多くの有用な関数を含んでいます。さまざまなデータ抽出方法を学び、条件を設定することもできます。その後、データのグループ化手法に精通することができます。また、データの前処理方法も学びます。各セクションには独自のデータセットが用意されているため、コースは魅力的なものとなっています。

このセクションでは、タイトルやインデックスによって特定の列を出力する方法を学びます。また、インデックスによって行を選択する方法についても理解を深めます。

ここでは、特定の条件を持つデータを抽出する方法を学びます。また、それらを組み合わせたり、自分自身で条件を作成したりする方法も学びます。

このセクションでは、さまざまなデータ条件の設定に関する知識を深めます。データが定義された値のリストに含まれているか、または2つの値の間にあるかを確認する方法を学びます。また、最大値と最小値を見つける方法についても学びます。

このセクションはコースの中でも特に興味深い内容の一つです。ここでは、データをさまざまな方法でグループ化する方法を学びます。特定のデータグループに関する情報を見つけるために、データアナリストとして役立つスキルを身につけることができます。

このセクションはデータアナリストにとって最も重要なものの一つです。なぜなら、データに不適切な形式の欠損値が含まれている場合、作業が不可能になるためです。したがって、ここではそのような不適切な値への対処方法を学びます。

NaN値はどのように処理するか？

解答


PassengerId	0
Survived	0
Pclass	0
Name	0
Sex	0
Age	86
SibSp	0
Parch	0
Ticket	0
Fare	1
Cabin	327
Embarked	0