Course Content

Ensemble Learning

1. Basic Principles of Building Ensemble Models

What is Ensemble of Models?Bagging Models Boosting Models Stacking Models

2. Commonly Used Bagging Models

Bagging Classifier Challenge: Solving Task Using Bagging Classifier Bagging Regressor Challenge: Solving Task Using Bagging Regressor Random Forest Challenge: Determining Feature Importances Using Random Forest ExtraTrees

3. Commonly Used Boosting Models

AdaBoost Classifier Challenge: Solving Task Using AdaBoost Classifier Challenge: Solving Task Using AdaBoost Regressor Gradient Boosting XGBoost Challenge: Solving Task Using XGBoost

4. Commonly Used Stacking Models

Stacking Classifier Challenge: Solving Task Using Stacking Classifier Challenge: Solving Task Using Stacking Regressor Using Ensembles As Base Models Course Summary

Random Forest

Random Forest is a bagging ensemble algorithm that is used for both classification and regression tasks. The basic idea behind Random Forest is to create a "forest" of decision trees, where each tree is trained on a different subset of the data and provides its own prediction.

How does Random Forest works?

Bootstrapping and Data Subset: Each tree in the forest is trained using a random subset drawn from the original dataset via bootstrapping. This process involves selecting data points with replacement, creating diverse subsets for each tree;
Decision Tree Construction: These subsets build individual decision trees. Data is recursively divided using features and thresholds, forming binary splits that lead to leaf nodes containing predictions;
Random Feature Selection: Within each tree, only a random subset of features is considered for creating splits. This randomness prevents single features from overpowering predictions and enhances tree diversity;
Prediction Aggregation: After training, each tree predicts for data points. For classification, we use hard or soft voting to create a prediction; for regression, predictions are averaged to provide the final outcome.

We can notice a rather interesting feature of a random tree: each base model is trained not only on a random subset of the training set, but also on a random subset of features. As a result, we get more independent base models and, as a result, more accurate final predictions.

Example

Let's solve the classification task using Random Forest on Iris dataset:


              1234567891011121314151617181920212223242526
            
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

# Load the Iris dataset
iris = load_iris()
X = iris.data  # Features
y = iris.target  # Target variable

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, n_jobs=-1)

# Train the classifier on the training data
rf_classifier.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf_classifier.predict(X_test)

# Calculate the F1 score of the classifier
f1 = f1_score(y_test, y_pred, average='weighted')
print(f'F1 Score: {f1:.2f}')

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 5

Ask AI

Ask anything or try one of the suggested questions to begin our chat