RecursiveFeatureAddition

API Reference

class feature_engine.selection.RecursiveFeatureAddition(estimator=RandomForestClassifier(), scoring='roc_auc', cv=3, threshold=0.01, variables=None)[source]

RecursiveFeatureAddition selects features following a recursive process.

The process is as follows:

  1. Train an estimator using all the features.

  2. Rank the features according to their importance, derived from the estimator.

  3. Train an estimator with the most important feature and determine its performance.

  4. Add the second most important feature and train a new estimator.

5. Calculate the difference in performance between the last estimator and the previous one.

6. If the performance increases beyond the threshold, then that feature is important and will be kept. Otherwise, that feature is removed . 7. Repeat steps 4-6 until all features have been evaluated.

Model training and performance calculation are done with cross-validation.

Parameters
variablesstr or list, default=None

The list of variable to be evaluated. If None, the transformer will evaluate all numerical features in the dataset.

estimatorobject, default = RandomForestClassifier()

A Scikit-learn estimator for regression or classification. The estimator must have either a feature_importances or coef_ attribute after fitting.

scoringstr, default=’roc_auc’

Desired metric to optimise the performance of the estimator. Comes from sklearn.metrics. See the model evaluation documentation for more options: https://scikit-learn.org/stable/modules/model_evaluation.html

thresholdfloat, int, default = 0.01

The value that defines if a feature will be kept or removed. Note that for metrics like roc-auc, r2_score and accuracy, the thresholds will be floats between 0 and 1. For metrics like the mean_square_error and the root_mean_square_error the threshold will be a big number. The threshold must be defined by the user. Bigger thresholds will select less features.

cvint, default=3

Cross-validation fold to be used to fit the estimator.

Attributes

initial_model_performance_ :

Performance of the model trained using the original dataset.

feature_importances_ :

Pandas Series with the feature importance.

performance_drifts_:

Dictionary with the performance drift per examined feature.

features_to_drop_:

List with the features to remove from the dataset.

Methods

fit:

Find the important features.

transform:

Reduce X to the selected features.

fit_transform:

Fit to data, then transform it.

fit(X, y)[source]

Find the important features. Note that the selector trains various models at each round of selection, so it might take a while.

Parameters
Xpandas dataframe of shape = [n_samples, n_features]

The input dataframe

yarray-like of shape (n_samples)

Target variable. Required to train the estimator.

Returns
self
transform(X)[source]

Return dataframe with selected features.

Parameters
Xpandas dataframe of shape = [n_samples, n_features].

The input dataframe.

Returns
X_transformed: pandas dataframe of shape = [n_samples, n_selected_features]

Pandas dataframe with the selected features.

rtype

DataFrame ..

Example

import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from feature_engine.selection import RecursiveFeatureElimination

# load dataset
diabetes_X, diabetes_y = load_diabetes(return_X_y=True)
X = pd.DataFrame(diabetes_X)
y = pd.DataFrame(diabetes_y)

# initialize linear regresion estimator
linear_model = LinearRegression()

# initialize feature selector
tr = RecursiveFeatureElimination(estimator=linear_model, scoring="r2", cv=3)

# fit transformer
Xt = tr.fit_transform(X, y)

# get the initial linear model performance, using all features
tr.initial_model_performance_
0.488702767247119
# Get the performance drift of each feature
tr.performance_drifts_
{4: 0,
 8: 0.2837159006046677,
 2: 0.1377700238871593,
 5: 0.0023329006089969906,
 3: 0.0187608758643259,
 1: 0.0027994385024313617,
 7: 0.0026951300105543807,
 6: 0.002683967832484757,
 9: 0.0003040126429713075,
 0: -0.007386876030245182}
# get the selected features
tr.selected_features_
[4, 8, 2, 3]
print(Xt.head())
          4         8         2         3
0 -0.044223  0.019908  0.061696  0.021872
1 -0.008449 -0.068330 -0.051474 -0.026328
2 -0.045599  0.002864  0.044451 -0.005671
3  0.012191  0.022692 -0.011595 -0.036656
4  0.003935 -0.031991 -0.036385  0.021872