Course Content
Ensemble Learning
Ensemble Learning
Random Forest
Random Forest is a bagging ensemble algorithm that is used for both classification and regression tasks. The basic idea behind Random Forest is to create a "forest" of decision trees, where each tree is trained on a different subset of the data and provides its own prediction.
How does Random Forest works?
-
Bootstrapping and Data Subset: Each tree in the forest is trained using a random subset drawn from the original dataset via bootstrapping. This process involves selecting data points with replacement, creating diverse subsets for each tree;
-
Decision Tree Construction: These subsets build individual decision trees. Data is recursively divided using features and thresholds, forming binary splits that lead to leaf nodes containing predictions;
-
Random Feature Selection: Within each tree, only a random subset of features is considered for creating splits. This randomness prevents single features from overpowering predictions and enhances tree diversity;
-
Prediction Aggregation: After training, each tree predicts for data points. For classification, we use hard or soft voting to create a prediction; for regression, predictions are averaged to provide the final outcome.
We can notice a rather interesting feature of a random tree: each base model is trained not only on a random subset of the training set, but also on a random subset of features. As a result, we get more independent base models and, as a result, more accurate final predictions.
Example
Let's solve the classification task using Random Forest on Iris dataset:
# Import necessary libraries from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import f1_score # Load the Iris dataset iris = load_iris() X = iris.data # Features y = iris.target # Target variable # Split the dataset into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create a Random Forest classifier rf_classifier = RandomForestClassifier(n_estimators=100, n_jobs=-1) # Train the classifier on the training data rf_classifier.fit(X_train, y_train) # Make predictions on the test data y_pred = rf_classifier.predict(X_test) # Calculate the F1 score of the classifier f1 = f1_score(y_test, y_pred, average='weighted') print(f'F1 Score: {f1:.2f}')
Thanks for your feedback!