Leer Blending & Hybrid Models | Practical Usage & Comparison

Blending and stacking are two ensemble techniques that can further boost the predictive power of tree-based models such as CatBoost, XGBoost, and LightGBM. In blending, you combine the predictions of several different models—often trained with different algorithms or hyperparameters—by averaging or weighting their outputs. This approach leverages the strengths of each individual model and can help reduce overfitting by smoothing out their unique errors.

Simple stacking takes this idea a step further by training a new model (often called a meta-learner) on the outputs of the base models. This meta-learner tries to learn the best way to combine the base predictions. While stacking can be more powerful than blending, it is also more complex to set up and tune.

For tree-based models, blending is particularly attractive because CatBoost, LightGBM, and XGBoost each have unique strengths and may capture different aspects of the data. By blending their predictions, you can often achieve more robust and accurate results, especially on challenging datasets or in competitions.


              1234567891011121314151617181920212223242526272829303132
            
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

# Generate synthetic binary classification data
X, y = make_classification(n_samples=2000, n_features=20, n_informative=15, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

# Train CatBoost
catboost_model = CatBoostClassifier(verbose=0, random_seed=42)
catboost_model.fit(X_train, y_train)
catboost_pred = catboost_model.predict_proba(X_val)[:, 1]

# Train LightGBM
lgbm_model = LGBMClassifier(random_state=42)
lgbm_model.fit(X_train, y_train)
lgbm_pred = lgbm_model.predict_proba(X_val)[:, 1]

# Simple blending: average the predictions
blend_pred = (catboost_pred + lgbm_pred) / 2

# Compute AUC for individual and blended predictions
catboost_auc = roc_auc_score(y_val, catboost_pred)
lgbm_auc = roc_auc_score(y_val, lgbm_pred)
blend_auc = roc_auc_score(y_val, blend_pred)

print("CatBoost AUC:", catboost_auc)
print("LightGBM AUC:", lgbm_auc)
print("Blended AUC:", blend_auc)

Note

Blending and stacking are most beneficial when your base models are diverse and make different types of errors. This diversity can come from using different algorithms, hyperparameters, or even training on different data subsets. However, blending or stacking similar models can sometimes provide little to no improvement.

Overfitting is a potential pitfall, especially if you blend on the same data used to train your base models or if your meta-learner is too complex. Always evaluate ensemble approaches on a separate validation set to ensure genuine improvement.

Taak

Swipe to start coding

You are given a binary classification dataset. Your goal is to:

Train three different gradient boosting models:
- CatBoostClassifier;
- XGBClassifier;
- LGBMClassifier.
Predict probabilities on the test set.
Blend all three models using simple averaging of probabilities.
Compute accuracy and store it in accuracy_value.
Print dataset shapes, model types, and blended accuracy.

Use CatBoost, XGBoost, LightGBM, and only sklearn-compatible APIs. No tuning, no loops besides blending logic.

Oplossing

Was alles duidelijk?

Bedankt voor je feedback!

Sectie 3. Hoofdstuk 2

single

Vraag AI

Vraag wat u wilt of probeer een van de voorgestelde vragen om onze chat te starten.

Awesome!

Completion rate improved to 11.11

Veeg om het menu te tonen


              1234567891011121314151617181920212223242526272829303132
            
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

# Generate synthetic binary classification data
X, y = make_classification(n_samples=2000, n_features=20, n_informative=15, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

# Train CatBoost
catboost_model = CatBoostClassifier(verbose=0, random_seed=42)
catboost_model.fit(X_train, y_train)
catboost_pred = catboost_model.predict_proba(X_val)[:, 1]

# Train LightGBM
lgbm_model = LGBMClassifier(random_state=42)
lgbm_model.fit(X_train, y_train)
lgbm_pred = lgbm_model.predict_proba(X_val)[:, 1]

# Simple blending: average the predictions
blend_pred = (catboost_pred + lgbm_pred) / 2

# Compute AUC for individual and blended predictions
catboost_auc = roc_auc_score(y_val, catboost_pred)
lgbm_auc = roc_auc_score(y_val, lgbm_pred)
blend_auc = roc_auc_score(y_val, blend_pred)

print("CatBoost AUC:", catboost_auc)
print("LightGBM AUC:", lgbm_auc)
print("Blended AUC:", blend_auc)