Feature Selection#

Feature-engine’s feature selection transformers are used to drop subsets of variables with low predictive value. Feature-engine hosts selection algorithms that are, in general, not available in other libraries. These algorithms have been gathered from data science competitions or used in the industry.

Feature-engine’s transformers select features based on different strategies. Some algorithms remove constant or quasi-constant features. Some algorithms remove duplicated or correlated variables. Some algorithms select features based on a machine learning model performance. Some transformers implement selection procedures used in finance. And some transformers support functionality that has been developed in the industry or in data science competitions.

In the following tables you find the algorithms that belong to each category.

Selection based on feature characteristics#

Transformer

Categorical variables

Allows NA

Description

DropFeatures()

Drops arbitrary features determined by user

DropConstantFeatures()

Drops constant and quasi-constant features

DropDuplicateFeatures()

Drops features that are duplicated

DropCorrelatedFeatures()

×

Drops features that are correlated

SmartCorrelatedSelection()

×

From a correlated feature group drops the less useful features

Selection based on a machine learning model#

Transformer

Categorical variables

Allows NA

Description

SelectBySingleFeaturePerformance()

×

×

Selects features based on single feature model performance

RecursiveFeatureElimination()

×

×

Removes features recursively by evaluating model performance

RecursiveFeatureAddition()

×

×

Adds features recursively by evaluating model performance

Selection methods commonly used in finance#

Transformer

Categorical variables

Allows NA

Description

DropHighPSIFeatures()

×

Drops features with high Population Stability Index

SelectByInformationValue()

x

Drops features with low information value

Alternative feature selection methods#

Transformer

Categorical variables

Allows NA

Description

SelectByShuffling()

×

×

Selects features if shuffling their values causes a drop in model performance

SelectByTargetMeanPerformance()

×

Using the target mean as performance proxy, selects high performing features

ProbeFeatureSelection()

×

×

Selects features who importance is greater than those of random variables

Other Feature Selection Libraries#

For additional feature selection algorithms visit the following open-source libraries:

Scikit-learn hosts multiple filter and embedded methods that select features based on statistical tests or machine learning model derived importance. MLXtend hosts greedy (wrapper) feature selection methods.