Course Content
Ensemble Learning
1. Basic Principles of Building Ensemble Models
Ensemble Learning
Random Forest
Random Forest is a bagging ensemble algorithm that is used for both classification and regression tasks. The basic idea behind Random Forest is to create a "forest" of decision trees, where each tree is trained on a different subset of the data and provides its own prediction.
How does Random Forest works?
- Bootstrapping and Data Subset: Each tree in the forest is trained using a random subset drawn from the original dataset via bootstrapping. This process involves selecting data points with replacement, creating diverse subsets for each tree.
- Decision Tree Construction: These subsets build individual decision trees. Data is recursively divided using features and thresholds, forming binary splits that lead to leaf nodes containing predictions.
- Random Feature Selection: Within each tree, only a random subset of features is considered for creating splits. This randomness prevents single features from overpowering predictions and enhances tree diversity.
- Prediction Aggregation: After training, each tree predicts for data points. For classification, we use hard or soft voting to create a prediction; for regression, predictions are averaged to provide the final outcome.
We can notice a rather interesting feature of a random tree: each base model is trained not only on a random subset of the training set, but also on a random subset of features. As a result, we get more independent base models and, as a result, more accurate final predictions.
Example
Let's solve classification task using Random Forest on Iris dataset:
Code Description
scikit-learn
: -
load_iris
: Used to load the Iris dataset.-
train_test_split
: Used to split the dataset into training and testing sets.-
RandomForestClassifier
: The classifier we'll be using, which is part of the ensemble module.-
f1_score
: The function to calculate the F1 score for model evaluation.load_iris
.- Extract the features into
X
and the target variable into y
.train_test_split
.test_size=0.2
specifies that 20% of the data will be used for testing.RandomForestClassifier
with n_estimators=100
(number of trees in the forest) and n_jobs=-1
(to train the model using all processors in parallel).- Train the classifier using the training data (features and target) with the
.fit()
method.X_test
).- Store the predicted labels in
y_pred
.f1_score()
function.- The
average='weighted'
parameter indicates that the F1 score sho
What model is used as a base model in Random Forest?
Select the correct answer
Everything was clear?