Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
XGBoost | Commonly Used Boosting Models
Ensemble Learning
course content

Course Content

Ensemble Learning

Ensemble Learning

1. Basic Principles of Building Ensemble Models
2. Commonly Used Bagging Models
3. Commonly Used Boosting Models
4. Commonly Used Stacking Models

bookXGBoost

XGBoost (Extreme Gradient Boosting) is a popular and powerful machine learning algorithm for classification and regression tasks. It's an ensemble learning technique that belongs to the gradient-boosting family of algorithms. XGBoost is known for its efficiency, scalability, and effectiveness in handling various machine-learning problems.

Key features of XGBoost

  1. Gradient Boosting: XGBoost is a variant of gradient boosting with shallow decision trees as base models. These trees are created in a greedy manner by recursively partitioning the data based on the feature that leads to the best split;
  2. Regularization: XGBoost incorporates regularization techniques to prevent overfitting. It includes terms in the objective function that penalize complex models, which helps in better generalization;
  3. Objective Function: XGBoost optimizes an objective function that combines the loss function (e.g., mean squared error for regression, log loss for classification) and regularization terms. The algorithm seeks to find the best model that minimizes this objective function;
  4. Parallel and Distributed Computing: XGBoost is designed to be efficient and scalable. It utilizes parallel and distributed computing techniques to speed up the training process, making it suitable for large datasets.

XGBoost's effectiveness lies in its ability to produce accurate predictions while managing issues like overfitting and underfitting. It has gained popularity in various machine-learning competitions and real-world applications due to its strong predictive performance and versatility.

Example

Firstly, we have to admit that XGBoost has no realization in the sklearn library, so we have to install xgboost manually using the following command in the console of your interpreter:
pip install xgboost.
After the installation is finished, we can use XGBoost to solve the tasks.

What is DMatrix?

Before we start working with the XGBoost ensemble model, we must get familiar with a specific data structure - DMatrix.
In XGBoost, DMatrix is a data structure that is optimized for efficiency and used to store the dataset during training and prediction. It's a core concept in the xgboost library and is designed to handle large datasets memory-efficient and fast. DMatrix serves as an input container for the training and testing data.

DMatrix example

1234567891011
import xgboost as xgb import numpy as np # Sample data X_train = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]) y_train = np.array([0, 1, 0]) # Create DMatrix objects dtrain = xgb.DMatrix(X_train, label=y_train) print(type(dtrain))
copy

XGBoost usage example

12345678910111213141516171819202122232425262728293031323334
import numpy as np import pandas as pd import xgboost as xgb from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.metrics import f1_score # Load the Iris dataset data = load_iris() X = data.data y = data.target # Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create DMatrix objects for XGBoost dtrain = xgb.DMatrix(X_train, label=y_train) dtest = xgb.DMatrix(X_test, label=y_test) # Set hyperparameters params = { 'objective': 'multi:softmax', 'num_class': 3 } # Train the XGBoost classifier model = xgb.train(params, dtrain) # Make predictions y_pred = model.predict(dtest) # Calculate accuracy f1 = f1_score(y_test, y_pred, average='weighted') print(f'F1-score: {f1:.4f}')
copy
What model is better to use if you want to avoid overfitting?

What model is better to use if you want to avoid overfitting?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 3. Chapter 5
some-alt