Feature Selection#

Feature-engine’s feature selection transformers are used to drop subsets of variables with low predictive value. Feature-engine hosts selection algorithms that are, in general, not available in other libraries. These algorithms have been gathered from data science competitions or used in the industry.

Feature-engine’s transformers select features based on different strategies. Some algorithms remove constant or quasi-constant features. Some algorithms remove duplicated or correlated variables. Some algorithms select features based on a machine learning model performance. Some transformers implement selection procedures used in finance. And some transformers support functionality that has been developed in the industry or in data science competitions.

In the following tables you find the algorithms that belong to each category.

Selection based on feature characteristics#

Transformer	Categorical variables	Allows NA	Description
`DropFeatures()`	√	√	Drops arbitrary features determined by user
`DropConstantFeatures()`	√	√	Drops constant and quasi-constant features
`DropDuplicateFeatures()`	√	√	Drops features that are duplicated
`DropCorrelatedFeatures()`	×	√	Drops features that are correlated
`SmartCorrelatedSelection()`	×	√	From a correlated feature group drops the less useful features

Selection based on a machine learning model#

Transformer	Categorical variables	Allows NA	Description
`SelectBySingleFeaturePerformance()`	×	×	Selects features based on single feature model performance
`RecursiveFeatureElimination()`	×	×	Removes features recursively by evaluating model performance
`RecursiveFeatureAddition()`	×	×	Adds features recursively by evaluating model performance

Selection methods commonly used in finance#

Transformer	Categorical variables	Allows NA	Description
`DropHighPSIFeatures()`	×	√	Drops features with high Population Stability Index
`SelectByInformationValue()`	√	x	Drops features with low information value

Alternative feature selection methods#

Transformer	Categorical variables	Allows NA	Description
`SelectByShuffling()`	×	×	Selects features if shuffling their values causes a drop in model performance
`SelectByTargetMeanPerformance()`	√	×	Using the target mean as performance proxy, selects high performing features
`ProbeFeatureSelection()`	×	×	Selects features who importance is greater than those of random variables

Other Feature Selection Libraries#

For additional feature selection algorithms visit the following open-source libraries:

Scikit-learn hosts multiple filter and embedded methods that select features based on statistical tests or machine learning model derived importance. MLXtend hosts greedy (wrapper) feature selection methods.

This site uses cookies