Вивчайте Wrapper and Embedded Methods: RFE and SelectFromModel

Understanding feature selection is crucial to building robust and interpretable machine learning models. Two important categories of feature selection techniques are wrapper methods and embedded methods. Wrapper methods, such as Recursive Feature Elimination (RFE), use a predictive model to evaluate combinations of features and select the best subset based on model performance. In contrast, embedded methods incorporate feature selection as part of the model training process itself — SelectFromModel with Lasso regression is a common example. The main difference is that wrapper methods repeatedly train models on different subsets of features, while embedded methods select features based on the internal model attributes, such as coefficients or feature importances, as they are learned.


              1234567891011121314151617181920212223242526272829303132333435
            
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.feature_selection import RFE, SelectFromModel
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd

# --- Load dataset (NumPy for speed) ---
data = fetch_california_housing()
X = data.data            # shape (20640, 8)
y = data.target
feature_names = np.array(data.feature_names)

# --- RFE (fewer refits via step=2) ---
lr = LinearRegression()
rfe = RFE(estimator=lr, n_features_to_select=5, step=2)
rfe.fit(X, y)
rfe_features = feature_names[rfe.support_]

# --- Lasso-based selection (scale first for faster convergence) ---
scaler = StandardScaler()
Xs = scaler.fit_transform(X)

lasso = Lasso(alpha=0.1, random_state=42, max_iter=1000)  # scaled => converges fast
lasso.fit(Xs, y)

sfm = SelectFromModel(lasso, prefit=True)
lasso_features = feature_names[sfm.get_support()]

# --- Compare ---
overlap = set(rfe_features) & set(lasso_features)

print("RFE selected features:", list(rfe_features))
print("SelectFromModel (Lasso) selected features:", list(lasso_features))
print("Overlap between RFE and Lasso-selected features:", list(overlap))

Both wrapper and embedded methods have distinct advantages and limitations. Wrapper methods like RFE are often more flexible because they can work with any model and can optimize for the specific predictive task. However, they are computationally expensive, especially with large datasets or many features, since they require fitting the model multiple times. Embedded methods such as SelectFromModel with Lasso are typically faster and scale better because feature selection happens during model training. However, their effectiveness depends on the model's assumptions; for instance, Lasso may arbitrarily select one feature among several highly correlated ones, potentially missing important predictors. As you saw in the code, the features selected by RFE and SelectFromModel with Lasso can overlap, but may also differ due to these underlying mechanisms.

Note

Multicollinearity — when two or more features are highly correlated — can impact feature selection. In such cases, methods like Lasso may select one correlated feature and ignore others, which can make interpretation tricky and sometimes lead to instability in the selected feature set.

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 2. Розділ 2

Запитати АІ

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Suggested prompts:

Can you explain why the selected features might differ between RFE and Lasso?

What are some scenarios where I should prefer wrapper methods over embedded methods?

Can you provide more examples of embedded feature selection techniques?

Свайпніть щоб показати меню


              1234567891011121314151617181920212223242526272829303132333435
            
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.feature_selection import RFE, SelectFromModel
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd

# --- Load dataset (NumPy for speed) ---
data = fetch_california_housing()
X = data.data            # shape (20640, 8)
y = data.target
feature_names = np.array(data.feature_names)

# --- RFE (fewer refits via step=2) ---
lr = LinearRegression()
rfe = RFE(estimator=lr, n_features_to_select=5, step=2)
rfe.fit(X, y)
rfe_features = feature_names[rfe.support_]

# --- Lasso-based selection (scale first for faster convergence) ---
scaler = StandardScaler()
Xs = scaler.fit_transform(X)

lasso = Lasso(alpha=0.1, random_state=42, max_iter=1000)  # scaled => converges fast
lasso.fit(Xs, y)

sfm = SelectFromModel(lasso, prefit=True)
lasso_features = feature_names[sfm.get_support()]

# --- Compare ---
overlap = set(rfe_features) & set(lasso_features)

print("RFE selected features:", list(rfe_features))
print("SelectFromModel (Lasso) selected features:", list(lasso_features))
print("Overlap between RFE and Lasso-selected features:", list(overlap))

Note

Все було зрозуміло?

Дякуємо за ваш відгук!

Секція 2. Розділ 2