Categorical Encoding#

Feature-engine’s categorical encoders replace variable strings by estimated or arbitrary numbers. The following image summarizes the main encoder’s functionality.

Summary of Feature-engine’s encoders characteristics

Transformer

Regression

Classification

Multi-class

Description

OneHotEncoder()

Adds dummy variables to represent each category

OrdinalEncoder()

√ Replaces categories with an integer

CountFreuencyEncoder()

Replaces categories with their count or frequency

MeanEncoder()

x

Replaces categories with the targe mean value

WoEEncoder()

x

x

Replaces categories with the weight of the evidence

DecisionTreeEncoder()

√ Replaces categories with the predictions of a decision tree

RareLabelEncoder()

√ Groups infrequent categories into a single one

Feature-engine’s categorical encoders work only with categorical variables by default. From version 1.1.0, you have the option to set the parameter ignore_format to False, and make the transformers also accept numerical variables as input.

Monotonicity

Most Feature-engine’s encoders will return, or attempt to return monotonic relationships between the encoded variable and the target. A monotonic relationship is one in which the variable value increases as the values in the other variable increase, or decrease. See the following illustration as examples:

../../_images/monotonic.png

Monotonic relationships tend to help improve the performance of linear models and build shallower decision trees.

Regression vs Classification

Most Feature-engine’s encoders are suitable for both regression and classification, with the exception of the WoEEncoder() and the PRatioEncoder() which are designed solely for binary classification.

Multi-class classification

Finally, some Feature-engine’s encoders can handle multi-class targets off-the-shelf for example the OneHotEncoder(), the :class:CountFrequencyEncoder()` and the DecisionTreeEncoder().

Note that while the MeanEncoder() and the OrdinalEncoder() will operate with multi-class targets, but the mean of the classes may not be significant and this will defeat the purpose of these encoding techniques.

Encoders