SelectBySingleFeaturePerformance#

The SelectBySingleFeaturePerformance() selects features based on the performance of machine learning models trained using individual features. That is, it selects features based on their individual performance. In short, the selection algorithms works as follows:

  1. Train a machine learning model per feature (using only 1 feature)

  2. Determine the performance metric of choice

  3. Retain features which performance is above a threshold

If the parameter threshold is left to None, it will select features which performance is above the mean performance of all features.

Example

Let’s see how to use this transformer with the diabetes dataset that comes in Scikit-learn. First, we load the data:

import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from feature_engine.selection import SelectBySingleFeaturePerformance

# load dataset
diabetes_X, diabetes_y = load_diabetes(return_X_y=True)
X = pd.DataFrame(diabetes_X)
y = pd.Series(diabetes_y)

Now, we start SelectBySingleFeaturePerformance() to select features based on the r2 returned by a Linear regression, using 3 fold cross-validation. We want to select features which r2 > 0.01.

# initialize feature selector
sel = SelectBySingleFeaturePerformance(
        estimator=LinearRegression(), scoring="r2", cv=3, threshold=0.01)

With fit() the transformer fits 1 model per feature, determines the performance and selects the important features:

# fit transformer
sel.fit(X, y)

The features that will be dropped are stored in an attribute:

sel.features_to_drop_
[1]

SelectBySingleFeaturePerformance() also stores the performace of each one of the models, in case we want to study those further:

sel.feature_performance_
{0: 0.029231969375784466,
 1: -0.003738551760264386,
 2: 0.336620809987693,
 3: 0.19219056680145055,
 4: 0.037115559827549806,
 5: 0.017854228256932614,
 6: 0.15153886177526896,
 7: 0.17721609966501747,
 8: 0.3149462084418813,
 9: 0.13876602125792703}

With transform() we go ahead and remove the features from the dataset:

# drop variables
Xt = sel.transform(X)

If we now print the transformed data, we see that the features above were removed.

print(Xt.head())
          0         2         3         4         5         6         7  \
0  0.038076  0.061696  0.021872 -0.044223 -0.034821 -0.043401 -0.002592
1 -0.001882 -0.051474 -0.026328 -0.008449 -0.019163  0.074412 -0.039493
2  0.085299  0.044451 -0.005670 -0.045599 -0.034194 -0.032356 -0.002592
3 -0.089063 -0.011595 -0.036656  0.012191  0.024991 -0.036038  0.034309
4  0.005383 -0.036385  0.021872  0.003935  0.015596  0.008142 -0.002592

          8         9
0  0.019907 -0.017646
1 -0.068332 -0.092204
2  0.002861 -0.025930
3  0.022688 -0.009362
4 -0.031988 -0.046641

More details#

Check also:

All notebooks can be found in a dedicated repository.

For more details about this and other feature selection methods check out these resources: