SelectBySingleFeaturePerformance#

The SelectBySingleFeaturePerformance() selects features based on the performance of machine learning models trained using individual features. That is, it selects features based on their individual performance. In short, the selection algorithms works as follows:

Train a machine learning model per feature (using only 1 feature)
Determine the performance metric of choice
Retain features which performance is above a threshold

If the parameter threshold is left to None, it will select features which performance is above the mean performance of all features.

Example

Let’s see how to use this transformer with the diabetes dataset that comes in Scikit-learn. First, we load the data:

import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from feature_engine.selection import SelectBySingleFeaturePerformance

# load dataset
diabetes_X, diabetes_y = load_diabetes(return_X_y=True)
X = pd.DataFrame(diabetes_X)
y = pd.Series(diabetes_y)

Now, we start SelectBySingleFeaturePerformance() to select features based on the r2 returned by a Linear regression, using 3 fold cross-validation. We want to select features which r2 > 0.01.

# initialize feature selector
sel = SelectBySingleFeaturePerformance(
        estimator=LinearRegression(), scoring="r2", cv=3, threshold=0.01)

With fit() the transformer fits 1 model per feature, determines the performance and selects the important features:

# fit transformer
sel.fit(X, y)

The features that will be dropped are stored in an attribute:

sel.features_to_drop_

[1]

SelectBySingleFeaturePerformance() also stores the performace of each one of the models, in case we want to study those further:

sel.feature_performance_

{0: 0.029231969375784466,
-0.003738551760264386,
0.336620809987693,
0.19219056680145055,
0.037115559827549806,
0.017854228256932614,
0.15153886177526896,
0.17721609966501747,
0.3149462084418813,
0.13876602125792703}

With transform() we go ahead and remove the features from the dataset:

# drop variables
Xt = sel.transform(X)

If we now print the transformed data, we see that the features above were removed.

print(Xt.head())

          0         2         3         4         5         6         7  \
0.038076  0.061696  0.021872 -0.044223 -0.034821 -0.043401 -0.002592
-0.001882 -0.051474 -0.026328 -0.008449 -0.019163  0.074412 -0.039493
0.085299  0.044451 -0.005670 -0.045599 -0.034194 -0.032356 -0.002592
-0.089063 -0.011595 -0.036656  0.012191  0.024991 -0.036038  0.034309
0.005383 -0.036385  0.021872  0.003935  0.015596  0.008142 -0.002592

          8         9
0.019907 -0.017646
-0.068332 -0.092204
0.002861 -0.025930
0.022688 -0.009362
-0.031988 -0.046641

Additional resources#

Check also:

Jupyter notebook

All notebooks can be found in a dedicated repository.

For more details about this and other feature selection methods check out these resources:

Feature Selection for Machine Learning#

Or read our book:

Feature Selection in Machine Learning#

Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.

This site uses cookies

SelectBySingleFeaturePerformance#

Additional resources#