Oppiskele XGBoost | Framework Deep Dive

XGBoost is a leading implementation of gradient boosted decision trees, known for its efficiency and scalability. It minimizes a loss function by using both the gradient (first derivative) and hessian (second derivative), enabling more informed tree splits and better optimization.

XGBoost features strong regularization: lambda (L2 regularization) and alpha (L1 regularization) control model complexity and help prevent overfitting by penalizing large leaf weights.

Its sparsity-aware split finding handles missing values and explicit zeros by learning the optimal path for missing data, making XGBoost robust and efficient with incomplete or sparse datasets.


              123456789101112131415161718192021222324252627282930313233343536
            
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

# 1) Generate a small synthetic dataset
X, y = make_classification(
    n_samples=300,
    n_features=10,
    n_informative=5,
    random_state=42
)

# 2) Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3) Create a simple XGBoost model
model = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42,
    verbosity=0
)

# 4) Fit the model
model.fit(X_train, y_train)

# 5) Predict and evaluate
preds = model.predict(X_test)
acc = accuracy_score(y_test, preds)

print("Test accuracy:", acc)

In this example, we train an XGBoost classifier using the scikit-learn interface, which provides an intuitive .fit() and .predict() workflow. The key parameters used are: n_estimators=100, which sets how many boosting rounds (trees) the model will build; learning_rate=0.1, which controls how much each new tree contributes to correcting previous errors (smaller values make learning more stable but require more trees); and max_depth=3, which defines how deep each decision tree can grow, influencing model complexity and overfitting. The training process is performed with model.fit(X_train, y_train), where XGBoost iteratively builds trees that minimize predictive error, and predictions are obtained via model.predict(X_test). Finally, we compute accuracy with accuracy_score, which measures how often the model correctly predicts class labels. This small example demonstrates how XGBoost’s core boosting mechanism, combined with just a few essential hyperparameters, can produce a strong baseline model with minimal setup.

Tehtävä

Swipe to start coding

You are given a regression dataset. Your task is to:

Load the dataset and split it into train/test sets.
Initialize an XGBRegressor with the following parameters:
- n_estimators=200.
- learning_rate=0.05.
- max_depth=4.
- subsample=0.8.
- random_state=42.
Train the model.
Predict on the test set.
Compute Mean Squared Error (MSE) and store it in mse_value.
Print dataset shapes, model parameters, and the final MSE.

Ratkaisu

Oliko kaikki selvää?

Kiitos palautteestasi!

Osio 2. Luku 1

single

Kysy tekoälyä

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Suggested prompts:

Can you explain what the gradient and hessian are in the context of XGBoost?

How do the regularization parameters lambda and alpha affect the model?

What does sparsity-aware split finding mean in practice?

Pyyhkäise näyttääksesi valikon