WoERatioCategoricalEncoder

The WoERatioCategoricalEncoder() replaces the labels by the weight of evidence or the ratio of probabilities. It only works for binary classification.

The weight of evidence is given by: np.log( p(1) / p(0) )

The target probability ratio is given by: p(1) / p(0)

The CountFrequencyCategoricalEncoder() works only with categorical variables. A list of variables can be indicated, or the encoder will automatically select all categorical variables in the train set.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine import categorical_encoders as ce

# Load dataset
def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data['pclass'] = data['pclass'].astype('O')
        data['embarked'].fillna('C', inplace=True)
        return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
                data.drop(['survived', 'name', 'ticket'], axis=1),
                data['survived'], test_size=0.3, random_state=0)

# set up a rare label encoder
rare_encoder = ce.RareLabelCategoricalEncoder(tol=0.03, n_categories=2,
                                variables=['cabin', 'pclass', 'embarked'])

# fit and transform data
train_t = rare_encoder.fit_transform(X_train)
test_t = rare_encoder.transform(X_train)

# set up a weight of evidence encoder
woe_encoder = ce.WoERatioCategoricalEncoder(
encoding_method='woe', variables=['cabin', 'pclass', 'embarked'])

# fit the encoder
woe_encoder.fit(train_t, y_train)

# transform
train_t = woe_encoder.transform(train_t)
test_t = woe_encoder.transform(test_t)

woe_encoder.encoder_dict_
{'cabin': {'B': 1.6299623810120747,
'C': 0.7217038208351837,
'D': 1.405081209799324,
'E': 1.405081209799324,
'Rare': 0.7387452866900354,
'n': -0.35752781962490193},
'pclass': {1: 0.9453018143294478,
2: 0.21009172435857942,
3: -0.5841726684724614},
'embarked': {'C': 0.6999054533737715,
'Q': -0.05044494288988759,
'S': -0.20113381737960143}}

API Reference

class feature_engine.categorical_encoders.WoERatioCategoricalEncoder(encoding_method='woe', variables=None)[source]

The WoERatioCategoricalEncoder() replaces categories by the weight of evidence or by the ratio between the probability of the target = 1 and the probability of the target = 0.

The weight of evidence is given by: np.log(P(X=xj|Y = 1)/P(X=xj|Y=0))

The target probability ratio is given by: p(1) / p(0)

And the log of the target probability ratio is: np.log( p(1) / p(0) )

Note: This categorical encoding is exclusive for binary classification.

For example in the variable colour, if the mean of the target = 1 for blue is 0.8 and the mean of the target = 0 is 0.2, blue will be replaced by: np.log(0.8/0.2) = 1.386 if log_ratio is selected. Alternatively, blue will be replaced by 0.8 / 0.2 = 4 if ratio is selected.

For details on the calculation of the weight of evidence visit: https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html

Note: the division by 0 is not defined and the log(0) is not defined. Thus, if p(0) = 0 for the ratio encoder, or either p(0) = 0 or p(1) = 0 for woe or log_ratio, in any of the variables, the encoder will return an error.

The encoder will encode only categorical variables (type ‘object’). A list of variables can be passed as an argument. If no variables are passed as argument, the encoder will find and encode all categorical variables (object type).

The encoder first maps the categories to the numbers for each variable (fit).

The encoder then transforms the categories into the mapped numbers (transform).

Parameters
  • encoding_method (str, default=woe) –

    Desired method of encoding.

    ’woe’: weight of evidence

    ’ratio’ : probability ratio

  • variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and select all object type variables.

fit(X, y)[source]

Learns the numbers that should be used to replace the categories in each variable. That is the WoE or ratio of probability.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the categorical variables.

  • y (pandas series.) – Target, must be binary [0,1].

encoder_dict\_

The dictionary containing the {category: WoE / ratio} pairs per variable.

Type

dictionary

inverse_transform(X)[source]

Convert the data back to the original representation.

Parameters

X_transformed (pandas dataframe of shape = [n_samples, n_features]) – The transformed dataframe.

Returns

X – The un-transformed dataframe, that is, containing the original values of the categorical variables.

Return type

pandas dataframe of shape = [n_samples, n_features]

transform(X)[source]

Replaces categories with the learned parameters.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The input samples.

Returns

X_transformed – The dataframe containing categories replaced by numbers.

Return type

pandas dataframe of shape = [n_samples, n_features]