WoEEncoder

API Reference

class feature_engine.encoding.WoEEncoder(variables=None)[source]

The WoERatioCategoricalEncoder() replaces categories by the weight of evidence (WoE). The WoE was used primarily in the financial sector to create credit risk scorecards.

The encoder will encode only categorical variables (type ‘object’). A list of variables can be passed as an argument. If no variables are passed the encoder will find and encode all categorical variables (object type).

The encoder first maps the categories to the weight of evidence for each variable (fit). The encoder then transforms the categories into the mapped numbers (transform).

Note

This categorical encoding is exclusive for binary classification.

The weight of evidence is given by:

\[log( p(X=xj|Y = 1) / p(X=xj|Y=0) )\]

The WoE is determined as follows:

We calculate the percentage positive cases in each category of the total of all positive cases. For example 20 positive cases in category A out of 100 total positive cases equals 20 %. Next, we calculate the percentage of negative cases in each category respect to the total negative cases, for example 5 negative cases in category A out of a total of 50 negative cases equals 10%. Then we calculate the WoE by dividing the category percentages of positive cases by the category percentage of negative cases, and take the logarithm, so for category A in our example WoE = log(20/10).

Note

  • If WoE values are negative, negative cases supersede the positive cases.

  • If WoE values are positive, positive cases supersede the negative cases.

  • And if WoE is 0, then there are equal number of positive and negative examples.

Encoding into WoE:

  • Creates a monotonic relationship between the encoded variable and the target

  • Returns variables in a similar scale

Note

The log(0) is not defined and the division by 0 is not defined. Thus, if any of the terms in the WoE equation are 0 for a given category, the encoder will return an error. If this happens, try grouping less frequent categories.

Parameters
variableslist, default=None

The list of categorical variables that will be encoded. If None, the encoder will find and select all object type variables.

Attributes

encoder_dict_ :

Dictionary with the WoE per variable.

See also

feature_engine.encoding.RareLabelEncoder
feature_engine.discretisation

Notes

For details on the calculation of the weight of evidence visit: https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html

In credit scoring, continuous variables are also transformed using the WoE. To do this, first variables are sorted into a discrete number of bins, and then these bins are encoded with the WoE as explained here for categorical variables. You can do this by combining the use of the equal width, equal frequency or arbitrary discretisers.

NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().

Methods

fit:

Learn the WoE per category, per variable.

transform:

Encode the categories to numbers.

fit_transform:

Fit to the data, then transform it.

inverse_transform:

Encode the numbers into the original categories.

fit(X, y)[source]

Learn the the WoE.

Parameters
Xpandas dataframe of shape = [n_samples, n_features]

The training input samples. Can be the entire dataframe, not just the categorical variables.

ypandas series.

Target, must be binary [0,1].

Returns
self
Raises
TypeError
  • If the input is not the Pandas DataFrame.

  • If any user provided variables are not categorical.

ValueError
  • If there are no categorical variables in df or df is empty

  • If variable(s) contain null values.

  • If y is not binary with values 0 and 1.

  • If p(0) = 0 or p(1) = 0.

inverse_transform(X)[source]

Convert the encoded variable back to the original values.

Parameters
Xpandas dataframe of shape = [n_samples, n_features].

The transformed dataframe.

Returns
Xpandas dataframe of shape = [n_samples, n_features].

The un-transformed dataframe, with the categorical variables containing the original values.

rtype

DataFrame ..

Raises
TypeError
  • If the input is not a Pandas DataFrame

ValueError
  • If the variable(s) contain null values

  • If the dataframe is not of same size as that used in fit()

transform(X)[source]

Replace categories with the learned parameters.

Parameters
Xpandas dataframe of shape = [n_samples, n_features].

The dataset to transform.

Returns
Xpandas dataframe of shape = [n_samples, n_features].

The dataframe containing the categories replaced by numbers.

rtype

DataFrame ..

Raises
TypeError

If the input is not a Pandas DataFrame

ValueError
  • If the variable(s) contain null values

  • If dataframe is not of same size as that used in fit()

Warning

If after encoding, NAN were introduced.

Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine.encoding import WoEEncoder, RareLabelEncoder

# Load dataset
def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data['pclass'] = data['pclass'].astype('O')
        data['embarked'].fillna('C', inplace=True)
        return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
                data.drop(['survived', 'name', 'ticket'], axis=1),
                data['survived'], test_size=0.3, random_state=0)

# set up a rare label encoder
rare_encoder = RareLabelEncoder(tol=0.03, n_categories=2, variables=['cabin', 'pclass', 'embarked'])

# fit and transform data
train_t = rare_encoder.fit_transform(X_train)
test_t = rare_encoder.transform(X_train)

# set up a weight of evidence encoder
woe_encoder = WoEEncoder(variables=['cabin', 'pclass', 'embarked'])

# fit the encoder
woe_encoder.fit(train_t, y_train)

# transform
train_t = woe_encoder.transform(train_t)
test_t = woe_encoder.transform(test_t)

woe_encoder.encoder_dict_
{'cabin': {'B': 1.6299623810120747,
'C': 0.7217038208351837,
'D': 1.405081209799324,
'E': 1.405081209799324,
'Rare': 0.7387452866900354,
'n': -0.35752781962490193},
'pclass': {1: 0.9453018143294478,
2: 0.21009172435857942,
3: -0.5841726684724614},
'embarked': {'C': 0.6999054533737715,
'Q': -0.05044494288988759,
'S': -0.20113381737960143}}