RecursiveFeatureAddition

API Reference

class feature_engine.selection.RecursiveFeatureAddition(estimator, scoring='roc_auc', cv=3, threshold=0.01, variables=None)[source]

RecursiveFeatureAddition selects features following a recursive process.

The process is as follows:

  1. Train an estimator using all the features.

  2. Rank the features according to their importance, derived from the estimator.

  3. Train an estimator with the most important feature and determine its performance.

  4. Add the second most important feature and train a new estimator.

5. Calculate the difference in performance between the last estimator and the previous one.

6. If the performance increases beyond the threshold, then that feature is important and will be kept. Otherwise, that feature is removed.

  1. Repeat steps 4-6 until all features have been evaluated.

Model training and performance calculation are done with cross-validation.

Parameters
estimator: object

A Scikit-learn estimator for regression or classification. The estimator must have either a feature_importances or coef_ attribute after fitting.

variables: str or list, default=None

The list of variable to be evaluated. If None, the transformer will evaluate all numerical features in the dataset.

scoring: str, default=’roc_auc’

Desired metric to optimise the performance of the estimator. Comes from sklearn.metrics. See the model evaluation documentation for more options: https://scikit-learn.org/stable/modules/model_evaluation.html

threshold: float, int, default = 0.01

The value that defines if a feature will be kept or removed. Note that for metrics like roc-auc, r2_score and accuracy, the thresholds will be floats between 0 and 1. For metrics like the mean_square_error and the root_mean_square_error the threshold will be a big number. The threshold must be defined by the user. Bigger thresholds will select less features.

cv: int, cross-validation generator or an iterable, default=3

Determines the cross-validation splitting strategy. Possible inputs for cv are:

For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, Fold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.

For more details check Scikit-learn’s cross_validate documentation

Attributes

initial_model_performance_ :

Performance of the model trained using the original dataset.

feature_importances_ :

Pandas Series with the feature importance (comes from step 2)

performance_drifts_:

Dictionary with the performance drift per examined feature.

features_to_drop_:

List with the features to remove from the dataset.

variables_:

The variables to consider for the feature selection.

n_features_in_:

The number of features in the train set used in fit.

Methods

fit:

Find the important features.

transform:

Reduce X to the selected features.

fit_transform:

Fit to data, then transform it.

fit(X, y)[source]

Find the important features. Note that the selector trains various models at each round of selection, so it might take a while.

Parameters
X: pandas dataframe of shape = [n_samples, n_features]

The input dataframe

y: array-like of shape (n_samples)

Target variable. Required to train the estimator.

Returns
self
transform(X)[source]

Return dataframe with selected features.

Parameters
X: pandas dataframe of shape = [n_samples, n_features].

The input dataframe.

Returns
X_transformed: pandas dataframe of shape = [n_samples, n_selected_features]

Pandas dataframe with the selected features.

rtype

DataFrame ..

Example

import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from feature_engine.selection import RecursiveFeatureElimination

# load dataset
diabetes_X, diabetes_y = load_diabetes(return_X_y=True)
X = pd.DataFrame(diabetes_X)
y = pd.DataFrame(diabetes_y)

# initialize linear regresion estimator
linear_model = LinearRegression()

# initialize feature selector
tr = RecursiveFeatureElimination(estimator=linear_model, scoring="r2", cv=3)

# fit transformer
Xt = tr.fit_transform(X, y)

# get the initial linear model performance, using all features
tr.initial_model_performance_
0.488702767247119
# Get the performance drift of each feature
tr.performance_drifts_
{4: 0,
 8: 0.2837159006046677,
 2: 0.1377700238871593,
 5: 0.0023329006089969906,
 3: 0.0187608758643259,
 1: 0.0027994385024313617,
 7: 0.0026951300105543807,
 6: 0.002683967832484757,
 9: 0.0003040126429713075,
 0: -0.007386876030245182}
# the features to drop
tr.features_to_drop_
[0, 6, 7, 9]
print(Xt.head())
          4         8         2         3
0 -0.044223  0.019908  0.061696  0.021872
1 -0.008449 -0.068330 -0.051474 -0.026328
2 -0.045599  0.002864  0.044451 -0.005671
3  0.012191  0.022692 -0.011595 -0.036656
4  0.003935 -0.031991 -0.036385  0.021872