Summary  
This chapter explains how to configure and use a gradient boosting library that natively handles categorical features through ordered boosting and efficient encoding to improve generalization, requiring minimal preprocessing. It demonstrates training a CatBoostClassifier on tabular data and making predictions.  

General domain of usage  
Machine learning classification on tabular datasets

CatBoost is a gradient boosting framework known for three core innovations:

- **Native categorical feature support:** unlike most frameworks that require you to manually encode categorical variables, CatBoost lets you pass raw string or integer categories directly. The model handles encoding internally, preserving information and reducing the risk of information leakage.
- **Ordered boosting to reduce overfitting:** traditional boosting can overfit when calculating statistics for categorical features, as it may use information from the entire dataset. CatBoost solves this by processing data in random orders and, for each data point, using only prior data in the sequence to compute statistics. This approach mimics real-world scenarios and improves generalization.
- **Efficient categorical encoding:** instead of expanding the feature space with one-hot encoding or using arbitrary label encodings, CatBoost uses target statistics and hash-based techniques. This approach efficiently captures complex interactions between categories and the target, reduces memory usage, and often leads to better predictive performance.

import pandas as pd
from catboost import CatBoostClassifier, Pool

# Create a synthetic dataset with categorical features
data = pd.DataFrame({
    "color": ["red", "blue", "green", "blue", "red", "green", "red", "blue"],
    "size": ["S", "M", "L", "S", "M", "L", "M", "S"],
    "price": [10, 15, 7, 12, 14, 8, 13, 11],
    "label": [0, 1, 0, 1, 1, 0, 1, 0]
})

# Specify categorical feature indices
cat_features = ["color", "size"]

# Split features and target
X = data[["color", "size", "price"]]
y = data["label"]

# Initialize CatBoostClassifier with minimal preprocessing
model = CatBoostClassifier(iterations=50, learning_rate=0.1, verbose=0)
model.fit(X, y, cat_features=cat_features)

# Predict on the training data
preds = model.predict(X)
print("Predictions:", preds)

**CatBoost's native handling of categorical features provides several advantages over traditional one-hot encoding.** One-hot encoding can dramatically increase the dimensionality of your dataset, especially when categorical features have many unique values, leading to slower training and higher memory usage. It also fails to capture relationships between categories and the target variable. In contrast, CatBoost's approach leverages target statistics and sophisticated encoding schemes that reduce memory overhead and can uncover subtle patterns in categorical data, often resulting in better predictive performance with less feature engineering.

Note

import unittest
import numpy as np
from catboost import CatBoostClassifier

def _ok(tc, cond, ok_msg, err_msg):
    if cond:
        tc._testMethodName = ok_msg
        tc.assertTrue(True)
    else:
        tc._testMethodName = err_msg
        tc.fail(err_msg)

class TestUserCode(unittest.TestCase):

    def test_required_variables(self):
        import user_code
        required = [
            "X_train", "X_test", "y_train", "y_test",
            "model", "y_pred", "accuracy_value"
        ]
        cond = all(hasattr(user_code, v) for v in required)
        _ok(
            self, cond,
            "All required variables are declared.",
            f"Expected variables {required} to be declared."
        )

    def test_model_is_catboost(self):
        import user_code
        cond = isinstance(user_code.model, CatBoostClassifier)
        _ok(
            self, cond,
            "`model` is CatBoostClassifier.",
            "Expected CatBoostClassifier instance."
        )

    def test_model_is_fitted(self):
        import user_code
        try:
            preds = user_code.model.predict(user_code.X_test)
            cond = isinstance(preds, np.ndarray)
        except Exception:
            cond = False
        _ok(
            self, cond,
            "Model is fitted before prediction.",
            "Expected model.fit() before calling predict()."
        )

    def test_predictions_array(self):
        import user_code
        cond = isinstance(user_code.y_pred, np.ndarray)
        _ok(
            self, cond,
            "`y_pred` is a NumPy array.",
            "Expected `y_pred` to be a NumPy array."
        )

    def test_accuracy_scalar(self):
        import user_code
        cond = np.isscalar(user_code.accuracy_value)
        _ok(
            self, cond,
            "`accuracy_value` is scalar.",
            "Expected `accuracy_value` to be a scalar."
        )

    def test_no_manual_loops(self):
        with open("user_code.py") as f:
            src = f.read()
        cond = "for " not in src
        _ok(
            self, cond,
            "No manual loops used.",
            "Detected forbidden manual loop; expected vectorized API."
        )

    def test_required_prints(self):
        with open("user_code.py") as f:
            src = f.read()
        required_tokens = ["print", "Accuracy", "shape"]
        cond = all(tok in src for tok in required_tokens)
        _ok(
            self, cond,
            "Print statements for shapes and accuracy exist.",
            "Missing required print statements."
        )


if __name__ == "__main__":
    unittest.main()


test_main.py

Master the most powerful modern tree-based ensemble methods—CatBoost, XGBoost, and LightGBM. Learn their unique innovations, practical tuning, and how to leverage them for high-performance machine learning tasks.

Explore the motivation, innovations, and regularization strategies behind modern gradient boosting frameworks.

Hands-on exploration of XGBoost, LightGBM, and CatBoost: their algorithms, unique features, and practical tuning.

Interpret, blend, and deploy advanced tree-based models for real-world applications.

CatBoost

Solução