Summary  
This chapter covers how to train and use gradient boosted decision trees with XGBoost, employing gradient and hessian-based split optimization, L1/L2 regularization, and sparsity-aware split finding, all via a simple fit/predict API with key hyperparameters like n_estimators, learning_rate, and max_depth.  

General domain of usage  
Binary classification on tabular data.

XGBoost is a leading implementation of gradient boosted decision trees, known for its efficiency and scalability. It minimizes a loss function by using both the **gradient** (first derivative) and **hessian** (second derivative), enabling more informed tree splits and better optimization.

XGBoost features strong regularization: `lambda` (**L2 regularization**) and `alpha` (**L1 regularization**) control model complexity and help prevent overfitting by penalizing large leaf weights.

Its **sparsity-aware split finding** handles missing values and explicit zeros by learning the optimal path for missing data, making XGBoost robust and efficient with incomplete or sparse datasets.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

# 1) Generate a small synthetic dataset
X, y = make_classification(
    n_samples=300,
    n_features=10,
    n_informative=5,
    random_state=42
)

# 2) Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3) Create a simple XGBoost model
model = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42,
    verbosity=0
)

# 4) Fit the model
model.fit(X_train, y_train)

# 5) Predict and evaluate
preds = model.predict(X_test)
acc = accuracy_score(y_test, preds)

print("Test accuracy:", acc)

In this example, we train an XGBoost classifier using the scikit-learn interface, which provides an intuitive `.fit()` and `.predict()` workflow. The key parameters used are: `n_estimators=100`, which sets how many boosting rounds (trees) the model will build; `learning_rate=0.1`, which controls how much each new tree contributes to correcting previous errors (smaller values make learning more stable but require more trees); and `max_depth=3`, which defines how deep each decision tree can grow, influencing model complexity and overfitting. The training process is performed with `model.fit(X_train, y_train)`, where XGBoost iteratively builds trees that minimize predictive error, and predictions are obtained via `model.predict(X_test)`. Finally, we compute accuracy with `accuracy_score`, which measures how often the model correctly predicts class labels. This small example demonstrates how XGBoost’s core boosting mechanism, combined with just a few essential hyperparameters, can produce a strong baseline model with minimal setup.


import unittest
import numpy as np
import re
from xgboost import XGBRegressor

def _ok(tc, cond, ok_msg, fail_msg):
    if cond:
        tc._testMethodName = ok_msg
        tc.assertTrue(True, ok_msg)
    else:
        tc._testMethodName = fail_msg
        tc.fail(fail_msg)

class TestUserCode(unittest.TestCase):

    def test_variables_declared(self):
        import user_code
        required = ["X_train", "X_test", "y_train", "y_test",
                    "model", "y_pred", "mse_value"]
        _ok(self,
            all(hasattr(user_code, v) for v in required),
            "All required variables declared.",
            f"Expected variables: {required}."
        )

    def test_model_is_xgboost(self):
        import user_code
        _ok(self,
            isinstance(user_code.model, XGBRegressor),
            "`model` is an XGBRegressor instance.",
            "Expected `model` to be XGBRegressor."
        )

    def test_model_fitted(self):
        import user_code
        try:
            preds = user_code.model.predict(user_code.X_test)
            fitted = isinstance(preds, np.ndarray)
        except Exception:
            fitted = False

        _ok(self,
            fitted,
            "Model is fitted before prediction.",
            "Expected model.fit() to be called before prediction."
        )

    def test_predictions_exist(self):
        import user_code
        _ok(self,
            isinstance(user_code.y_pred, np.ndarray),
            "`y_pred` exists and is a NumPy array.",
            "Expected NumPy array predictions."
        )

    def test_mse_value_exists(self):
        import user_code
        _ok(self,
            np.isscalar(user_code.mse_value),
            "`mse_value` is a scalar.",
            "Expected scalar MSE value."
        )

    def test_no_manual_loops(self):
        with open("user_code.py", "r") as f:
            src = f.read()

        _ok(self,
            "for " not in src,
            "No manual loops used (vectorized & sklearn API).",
            "Detected manual loop; not expected."
        )

    def test_required_prints(self):
        with open("user_code.py", "r") as f:
            src = f.read()

        tokens = ["print", "MSE", "shape"]
        _ok(self,
            all(t in src for t in tokens),
            "Print outputs for shapes and MSE included.",
            "Expected print statements for shapes and MSE."
        )


if __name__ == "__main__":
    unittest.main()


test_main.py

Master the most powerful modern tree-based ensemble methods—CatBoost, XGBoost, and LightGBM. Learn their unique innovations, practical tuning, and how to leverage them for high-performance machine learning tasks.

Explore the motivation, innovations, and regularization strategies behind modern gradient boosting frameworks.

Hands-on exploration of XGBoost, LightGBM, and CatBoost: their algorithms, unique features, and practical tuning.

Interpret, blend, and deploy advanced tree-based models for real-world applications.

XGBoost

Solución