Use the original `penguins.csv`: first remove the two rows with insufficient data. Build one **preprocessing pipeline** that performs encoding, imputing, and scaling.

Only `'sex'` and `'island'` should be encoded, so use a `ColumnTransformer`. After that, apply `SimpleImputer` and `StandardScaler` to all features.

Here is a reminder of the `make_column_transformer()` and `make_pipeline()` functions you will use.

import unittest
import pandas as pd
import numpy as np

def _dynamic_test(test_case, condition, success_message, failure_message):
    if condition:
        test_case._testMethodName = success_message
        test_case.assertTrue(True, success_message)
    else:
        test_case._testMethodName = failure_message
        test_case.fail(failure_message)

class TestPipelineWithColumnTransformer(unittest.TestCase):

    @classmethod
    def setUpClass(cls):
        cls.df_raw = pd.read_csv(
            'https://codefinity-content-media.s3.eu-west-1.amazonaws.com/a65bbc96-309e-4df9-a790-a1eb8c815a1c/penguins.csv'
        )
        cls.df_raw = cls.df_raw[cls.df_raw.isna().sum(axis=1) < 2]
        import user_code
        cls.user_code = user_code

    def test_import_pipeline(self):
        from sklearn.pipeline import Pipeline, make_pipeline
        uc = self.user_code
        cond = isinstance(getattr(uc, 'pipe', None), (Pipeline,))
        _dynamic_test(
            self,
            cond,
            "Pipeline was created using make_pipeline",
            "Pipeline was created using make_pipeline"
        )

    def test_columntransformer_used(self):
        from sklearn.compose import ColumnTransformer
        uc = self.user_code
        cond = isinstance(getattr(uc, 'ct', None), ColumnTransformer)
        _dynamic_test(
            self,
            cond,
            "ColumnTransformer was used for 'sex' and 'island' with remainder passthrough",
            "ColumnTransformer was used for 'sex' and 'island' with remainder passthrough"
        )

    def test_pipeline_steps(self):
        uc = self.user_code
        step_names = [name for name, _ in uc.pipe.steps]
        cond = ("columntransformer" in step_names
                and any("simpleimputer" in n for n in step_names)
                and any("standardscaler" in n for n in step_names))
        _dynamic_test(
            self,
            cond,
            "Pipeline contains ColumnTransformer, SimpleImputer, and StandardScaler",
            "Pipeline contains ColumnTransformer, SimpleImputer, and StandardScaler"
        )

    def test_X_transformed_shape(self):
        uc = self.user_code
        X_arr = np.asarray(uc.X_transformed)
        cond = X_arr.shape[0] == self.df_raw.shape[0] and X_arr.ndim == 2
        _dynamic_test(
            self,
            cond,
            "X_transformed has correct number of rows and is 2D",
            "X_transformed has correct number of rows and is 2D"
        )

if __name__ == "__main__":
    unittest.main()


test_code.py

Machine learning is now used everywhere. Want to learn it yourself? This course is an introduction to the world of Machine learning for you to learn basic concepts, work with Scikit-learn – the most popular library for ML and build your first Machine Learning project.
This course is intended for students with a basic knowledge of Python, Pandas, and Numpy.

Learn the Machine Learning concepts and the ML project workflow.

Preprocessing is probably the most important stage of an ML project. This chapter covers the preprocessing steps needed for almost any dataset.

A pipeline is a neat way to combine all the preprocessing steps as well as a model. Pipelines make it much easier to train and use a model.

Modeling is the most fun stage of an ML project. Let's learn to build, fine-tune and evaluate the model!

Challenge: Creating a Pipeline

Solution

Challenge: Creating a Pipeline

Solution