Challenge: Classifying Inseparable Data
You will use the following dataset with two features:
1234import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/circles.csv') print(df.head())
If you run the code below and take a look at the resulting scatter plot, you'll see that the dataset is not linearly separable:
123456import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/circles.csv') plt.scatter(df['X1'], df['X2'], c=df['y']) plt.show()
Let's use cross-validation to evaluate a simple logistic regression on this data:
123456789101112131415161718import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/circles.csv') X = df[['X1', 'X2']] y = df['y'] X = StandardScaler().fit_transform(X) lr = LogisticRegression().fit(X, y) y_pred = lr.predict(X) plt.scatter(df['X1'], df['X2'], c=y_pred) plt.show() print(f'Cross-validation accuracy: {cross_val_score(lr, X, y).mean():.2f}')
As you can see, regular Logistic Regression is not suited for this task. Using polynomial regression may help improve the model's performance. Additionally, employing GridSearchCV allows you to find the optimal C parameter for better accuracy.
This task also uses the Pipeline class. You can think of it as a sequence of preprocessing steps. Its .fit_transform() method sequentially applies .fit_transform() to each step in the pipeline.
Swipe to start coding
You are given a dataset described as a DataFrame in the df variable.
- Create a pipeline that will hold the polynomial features of degree 2 of
Xand be scaled and store the resulting pipeline in thepipevariable. - Create a
param_griddictionary to with values[0.01, 0.1, 1, 10, 100]of theChyperparameter. - Initialize and train a
GridSearchCVobject and store the trained object in thegrid_cvvariable.
Solution
Thanks for your feedback!
single
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Can you explain how to use polynomial regression with this dataset?
How does GridSearchCV help in finding the optimal C parameter?
What is the purpose of using a Pipeline in this context?
Awesome!
Completion rate improved to 4.17
Challenge: Classifying Inseparable Data
Swipe to show menu
You will use the following dataset with two features:
1234import pandas as pd df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/circles.csv') print(df.head())
If you run the code below and take a look at the resulting scatter plot, you'll see that the dataset is not linearly separable:
123456import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/circles.csv') plt.scatter(df['X1'], df['X2'], c=df['y']) plt.show()
Let's use cross-validation to evaluate a simple logistic regression on this data:
123456789101112131415161718import pandas as pd import matplotlib.pyplot as plt from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression from sklearn.model_selection import cross_val_score df = pd.read_csv('https://codefinity-content-media.s3.eu-west-1.amazonaws.com/b71ff7ac-3932-41d2-a4d8-060e24b00129/circles.csv') X = df[['X1', 'X2']] y = df['y'] X = StandardScaler().fit_transform(X) lr = LogisticRegression().fit(X, y) y_pred = lr.predict(X) plt.scatter(df['X1'], df['X2'], c=y_pred) plt.show() print(f'Cross-validation accuracy: {cross_val_score(lr, X, y).mean():.2f}')
As you can see, regular Logistic Regression is not suited for this task. Using polynomial regression may help improve the model's performance. Additionally, employing GridSearchCV allows you to find the optimal C parameter for better accuracy.
This task also uses the Pipeline class. You can think of it as a sequence of preprocessing steps. Its .fit_transform() method sequentially applies .fit_transform() to each step in the pipeline.
Swipe to start coding
You are given a dataset described as a DataFrame in the df variable.
- Create a pipeline that will hold the polynomial features of degree 2 of
Xand be scaled and store the resulting pipeline in thepipevariable. - Create a
param_griddictionary to with values[0.01, 0.1, 1, 10, 100]of theChyperparameter. - Initialize and train a
GridSearchCVobject and store the trained object in thegrid_cvvariable.
Solution
Thanks for your feedback!
single