The WoERatioCategoricalEncoder() replaces the labels by the weight of evidence or the ratio of probabilities. It only works for binary classification.

The weight of evidence is given by: np.log( p(1) / p(0) )

The target probability ratio is given by: p(1) / p(0)

The CountFrequencyCategoricalEncoder() works only with categorical variables. A list of variables can be indiacated, or the imputer will automatically select all categorical variables in the train set.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine import categorical_encoders as ce

# Load dataset
def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data['pclass'] = data['pclass'].astype('O')
        data['embarked'].fillna('C', inplace=True)
        return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
                data.drop(['survived', 'name', 'ticket'], axis=1),
                data['survived'], test_size=0.3, random_state=0)

# set up a rare label encoder
rare_encoder = ce.RareLabelCategoricalEncoder(tol=0.03, n_categories=5,
                                variables=['cabin', 'pclass', 'embarked'])

# fit and transform data
train_t = rare_encoder.fit_transform(X_train)
test_t = rare_encoder.transform(X_train)

# set up a weight of evidence encoder
encoder = ce.WoERatioCategoricalEncoder(
encoding_method='woe', variables=['cabin', 'pclass', 'embarked'])

# fit the encoder
encoder.fit(train_t, y_train)

# transform
train_t = rare_encoder.transform(train_t)
test_t = rare_encoder.transform(test_t)

{'cabin': {'B': 1.1631508098056806,
  'C': 0.2548922496287902,
  'D': 0.9382696385929302,
  'E': 0.9382696385929302,
  'Rare': 0.2719337154836416,
  'n': -0.8243393908312957},
 'pclass': {1: 0.4784902431230542,
  2: -0.25671984684781396,
  3: -1.0509842396788551},
 'embarked': {'C': 0.23309388216737797,
  'Q': -0.5172565140962812,
  'S': -0.6679453885859952}}

API Reference

class feature_engine.categorical_encoders.WoERatioCategoricalEncoder(encoding_method='woe', variables=None)[source]

The WoERatioCategoricalEncoder() replaces categories by the weight of evidence or by the ratio between the probability of the target = 1 and the probability of the target = 0.

The weight of evidence is given by: np.log( p(1) / p(0) )

The target probability ratio is given by: p(1) / p(0)

Note: This categorical encoder is exclusive for binary classification.

For example in the variable colour, if the mean of the target = 1 for blue is 0.8 and the mean of the target = 0 is 0.2, blue will be replaced by: np.log(0.8/0.2) = 1.386 if woe is selected. Alternatively, blue will be replaced by 0.8 / 0.2 = 4.

Note: the division by 0 is not defined and the log(0) is not defined. Thus, if p(0) = 0 for the ratio encoder, or either p(0) = 0 or p(1) = 0 for woe, in any of the variables, the encoder will return an error.

The Encoder will encode only categorical variables (type ‘object’). A list of variables can be passed as an argument. If no variables are passed as argument, the encoder will only encode categorical variables (object type) and ignore the rest.

The encoder first maps the categories to the numbers for each variable (fit). The encoder then transforms the categories into the mapped numbers (transform).

  • encoding_method (str, default=woe) – Desired method of encoding. ‘woe’: weight of evidence ‘ratio’ : probability ratio
  • variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and select all object type variables.

The dictionary containing the {woe: category} pairs or the {prob ratio: category} pairs used to replace the categories in each variable.

fit(self, X, y)[source]

Learns the numbers that should be used to replace the categories in each variable.

  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just seleted variables.
  • y (Target, must be binary [0,1]) –