Wrapper and Embedded Methods: RFE and SelectFromModel
Understanding feature selection is crucial to building robust and interpretable machine learning models. Two important categories of feature selection techniques are wrapper methods and embedded methods. Wrapper methods, such as Recursive Feature Elimination (RFE), use a predictive model to evaluate combinations of features and select the best subset based on model performance. In contrast, embedded methods incorporate feature selection as part of the model training process itself — SelectFromModel with Lasso regression is a common example. The main difference is that wrapper methods repeatedly train models on different subsets of features, while embedded methods select features based on the internal model attributes, such as coefficients or feature importances, as they are learned.
1234567891011121314151617181920212223242526272829303132333435from sklearn.datasets import fetch_california_housing from sklearn.linear_model import LinearRegression, Lasso from sklearn.feature_selection import RFE, SelectFromModel from sklearn.preprocessing import StandardScaler import numpy as np import pandas as pd # --- Load dataset (NumPy for speed) --- data = fetch_california_housing() X = data.data # shape (20640, 8) y = data.target feature_names = np.array(data.feature_names) # --- RFE (fewer refits via step=2) --- lr = LinearRegression() rfe = RFE(estimator=lr, n_features_to_select=5, step=2) rfe.fit(X, y) rfe_features = feature_names[rfe.support_] # --- Lasso-based selection (scale first for faster convergence) --- scaler = StandardScaler() Xs = scaler.fit_transform(X) lasso = Lasso(alpha=0.1, random_state=42, max_iter=1000) # scaled => converges fast lasso.fit(Xs, y) sfm = SelectFromModel(lasso, prefit=True) lasso_features = feature_names[sfm.get_support()] # --- Compare --- overlap = set(rfe_features) & set(lasso_features) print("RFE selected features:", list(rfe_features)) print("SelectFromModel (Lasso) selected features:", list(lasso_features)) print("Overlap between RFE and Lasso-selected features:", list(overlap))
Both wrapper and embedded methods have distinct advantages and limitations. Wrapper methods like RFE are often more flexible because they can work with any model and can optimize for the specific predictive task. However, they are computationally expensive, especially with large datasets or many features, since they require fitting the model multiple times. Embedded methods such as SelectFromModel with Lasso are typically faster and scale better because feature selection happens during model training. However, their effectiveness depends on the model's assumptions; for instance, Lasso may arbitrarily select one feature among several highly correlated ones, potentially missing important predictors. As you saw in the code, the features selected by RFE and SelectFromModel with Lasso can overlap, but may also differ due to these underlying mechanisms.
Multicollinearity — when two or more features are highly correlated — can impact feature selection. In such cases, methods like Lasso may select one correlated feature and ignore others, which can make interpretation tricky and sometimes lead to instability in the selected feature set.
Дякуємо за ваш відгук!
Запитати АІ
Запитати АІ
Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат
Awesome!
Completion rate improved to 8.33
Wrapper and Embedded Methods: RFE and SelectFromModel
Свайпніть щоб показати меню
Understanding feature selection is crucial to building robust and interpretable machine learning models. Two important categories of feature selection techniques are wrapper methods and embedded methods. Wrapper methods, such as Recursive Feature Elimination (RFE), use a predictive model to evaluate combinations of features and select the best subset based on model performance. In contrast, embedded methods incorporate feature selection as part of the model training process itself — SelectFromModel with Lasso regression is a common example. The main difference is that wrapper methods repeatedly train models on different subsets of features, while embedded methods select features based on the internal model attributes, such as coefficients or feature importances, as they are learned.
1234567891011121314151617181920212223242526272829303132333435from sklearn.datasets import fetch_california_housing from sklearn.linear_model import LinearRegression, Lasso from sklearn.feature_selection import RFE, SelectFromModel from sklearn.preprocessing import StandardScaler import numpy as np import pandas as pd # --- Load dataset (NumPy for speed) --- data = fetch_california_housing() X = data.data # shape (20640, 8) y = data.target feature_names = np.array(data.feature_names) # --- RFE (fewer refits via step=2) --- lr = LinearRegression() rfe = RFE(estimator=lr, n_features_to_select=5, step=2) rfe.fit(X, y) rfe_features = feature_names[rfe.support_] # --- Lasso-based selection (scale first for faster convergence) --- scaler = StandardScaler() Xs = scaler.fit_transform(X) lasso = Lasso(alpha=0.1, random_state=42, max_iter=1000) # scaled => converges fast lasso.fit(Xs, y) sfm = SelectFromModel(lasso, prefit=True) lasso_features = feature_names[sfm.get_support()] # --- Compare --- overlap = set(rfe_features) & set(lasso_features) print("RFE selected features:", list(rfe_features)) print("SelectFromModel (Lasso) selected features:", list(lasso_features)) print("Overlap between RFE and Lasso-selected features:", list(overlap))
Both wrapper and embedded methods have distinct advantages and limitations. Wrapper methods like RFE are often more flexible because they can work with any model and can optimize for the specific predictive task. However, they are computationally expensive, especially with large datasets or many features, since they require fitting the model multiple times. Embedded methods such as SelectFromModel with Lasso are typically faster and scale better because feature selection happens during model training. However, their effectiveness depends on the model's assumptions; for instance, Lasso may arbitrarily select one feature among several highly correlated ones, potentially missing important predictors. As you saw in the code, the features selected by RFE and SelectFromModel with Lasso can overlap, but may also differ due to these underlying mechanisms.
Multicollinearity — when two or more features are highly correlated — can impact feature selection. In such cases, methods like Lasso may select one correlated feature and ignore others, which can make interpretation tricky and sometimes lead to instability in the selected feature set.
Дякуємо за ваш відгук!