Feature Selection

Feature-engine’s feature selection transformers are used to drop subsets of variables, or in other words, to select subsets of variables. Feature-engine hosts selection algorithms that are, in general, not available in other libraries. These algorithms have been gathered from data science competitions or used in the industry.

Feature-engine’s transformers select features based on 2 strategies. They either select features by looking at the features intrinsic characteristics, like distributions or their relationship with other features. Or they select features based on their impact on the machine learning model performance.

In the following tables you find the algorithms that belong to either category.

Selection based on feature characteristics

Transformer

Categorical variables

Allows NA

Description

DropFeatures()

Drops arbitrary features determined by user

DropConstantFeatures()

Drops constant and quasi-constant features

DropDuplicateFeatures()

Drops features that are duplicated

DropCorrelatedFeatures()

×

Drops features that are correlated

SmartCorrelatedSelection()

×

From a correlated feature group drops the less useful features

DropHighPSIFeatures()

×

Drops features with high Population Stability Index

Selection based on model performance

Transformer

Categorical variables

Allows NA

Description

SelectByShuffling()

×

×

Selects features if shuffling their values causes a drop in model performance

SelectBySingleFeaturePerformance()

×

×

Removes observations with missing data from the dataset

SelectByTargetMeanPerformance()

×

Using the target mean as performance proxy, selects high performing features

RecursiveFeatureElimination()

×

×

Removes features recursively by evaluating model performance

RecursiveFeatureAddition()

×

×

Adds features recursively by evaluating model performance

Other Feature Selection Libraries

For additional feature selection algorithms visit the following open-source libraries:

Scikit-learn hosts multiple filter and embedded methods that select features based on statistical tests or machine learning model derived importance. MLXtend hosts greedy (wrapper) feature selection methods.