SelectByShuffling#

The SelectByShuffling() selects important features if a random permutation of their values decreases the model performance. If the feature is predictive, a random shuffle of the values across the rows, should return predictions that are off the truth. If the feature is not predictive, their values should have a minimal impact on the prediction.

Procedure#

The algorithm works as follows:

  1. Train a machine learning model using all features

  2. Determine a model performance metric of choice

  3. Shuffle the order of 1 feature values

  4. Use the model trained in 1 to obtain new predictions

  5. Determine the performance with the predictions in 4

  6. If there is a drop in performance beyond a threshold, keep the feature.

  7. Repeat 3-6 until all features are examined.

Example

Let’s see how to use this transformer with the diabetes dataset that comes in Scikit-learn. First, we load the data:

import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from feature_engine.selection import SelectByShuffling

# load dataset
diabetes_X, diabetes_y = load_diabetes(return_X_y=True)
X = pd.DataFrame(diabetes_X)
y = pd.Series(diabetes_y)

Now, we set up the model for which we want to have the performance drop evaluated:

# initialize linear regresion estimator
linear_model = LinearRegression()

Now, we instantiate SelectByShuffling() to select features by shuffling, based on the r2 of the model from the previous cell, using 3 fold cross-validation. The parameter threshold was left to None, which means that features will be selected if the performance drop is bigger than the mean drop caused by all features.

# initialize feature selector
tr = SelectByShuffling(estimator=linear_model, scoring="r2", cv=3)

With fit() the transformer finds the important variables, that is, those which values permutations caused a drop in the model performance. With transform() it drops them from the dataset:

# fit transformer
Xt = tr.fit_transform(X, y)

SelectByShuffling() stores the performance of the model trained using all the features in its attribute:

tr.initial_model_performance_
0.488702767247119

SelectByShuffling() also stores the performance change caused by every single feature after shuffling. In case you are not satisfied with the threshold used, you can get an idea of where the threshold could be by looking at these values:

tr.performance_drifts_
{0: -0.0035681361984126747,
 1: 0.041170843574652394,
 2: 0.1920054944393057,
 3: 0.07007527443645178,
 4: 0.49871458125373913,
 5: 0.1802858704499694,
 6: 0.025536233845966705,
 7: 0.024058931694668884,
 8: 0.40901959802129045,
 9: 0.004487448637912506}

SelectByShuffling() also stores the features that will be dropped based on the threshold indicated.

tr.features_to_drop_
[0, 1, 3, 6, 7, 9]

If we now print the transformed data, we see that the features above were removed.

print(Xt.head())
          2         4         5         8
0  0.061696 -0.044223 -0.034821  0.019907
1 -0.051474 -0.008449 -0.019163 -0.068332
2  0.044451 -0.045599 -0.034194  0.002861
3 -0.011595  0.012191  0.024991  0.022688
4 -0.036385  0.003935  0.015596 -0.031988

For more details about this and other feature selection methods check out these resources: