PRatioEncoder

API Reference

class feature_engine.encoding.PRatioEncoder(encoding_method='ratio', variables=None)[source]

The PRatioEncoder() replaces categories by the ratio of the probability of the target = 1 and the probability of the target = 0.

The target probability ratio is given by:

\[p(1) / p(0)\]

The log of the target probability ratio is:

\[log( p(1) / p(0) )\]

Note

This categorical encoding is exclusive for binary classification.

For example in the variable colour, if the mean of the target = 1 for blue is 0.8 and the mean of the target = 0 is 0.2, blue will be replaced by: 0.8 / 0.2 = 4 if ratio is selected, or log(0.8/0.2) = 1.386 if log_ratio is selected.

Note: the division by 0 is not defined and the log(0) is not defined. Thus, if p(0) = 0 for the ratio encoder, or either p(0) = 0 or p(1) = 0 for log_ratio, in any of the variables, the encoder will return an error.

The encoder will encode only categorical variables (type ‘object’). A list of variables can be passed as an argument. If no variables are passed the encoder will find and encode all categorical variables (object type).

The encoder first maps the categories to the numbers for each variable (fit). The encoder then transforms the categories into the mapped numbers (transform).

Parameters
encoding_methodstr, default=woe

Desired method of encoding.

‘ratio’ : probability ratio

‘log_ratio’ : log probability ratio

variableslist, default=None

The list of categorical variables to encode. If None, the encoder will find and select all object type variables.

Attributes

encoder_dict_ :

Dictionary with the probability ratio per category per variable.

Notes

NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().

Methods

fit:

Learn probability ratio per category, per variable.

transform:

Encode categories into numbers.

fit_transform:

Fit to the data, then transform it.

inverse_transform:

Encode the numbers into the original categories.

fit(X, y)[source]

Learn the numbers that should be used to replace the categories in each variable. That is the ratio of probability.

Parameters
Xpandas dataframe of shape = [n_samples, n_features]

The training input samples. Can be the entire dataframe, not just the categorical variables.

ypandas series.

Target, must be binary [0,1].

Returns
self
Raises
TypeError
  • If the input is not the Pandas DataFrame.

  • If any user provided variables are not categorical.

ValueError
  • If there are no categorical variables in df or df is empty

  • If variable(s) contain null values.

  • If y is not binary with values 0 and 1.

  • If p(0) = 0 or any of p(0) or p(1) are 0.

inverse_transform(X)[source]

Convert the encoded variable back to the original values.

Parameters
Xpandas dataframe of shape = [n_samples, n_features].

The transformed dataframe.

Returns
Xpandas dataframe of shape = [n_samples, n_features].

The un-transformed dataframe, with the categorical variables containing the original values.

rtype

DataFrame ..

Raises
TypeError
  • If the input is not a Pandas DataFrame

ValueError
  • If the variable(s) contain null values

  • If the dataframe is not of same size as that used in fit()

transform(X)[source]

Replace categories with the learned parameters.

Parameters
Xpandas dataframe of shape = [n_samples, n_features].

The dataset to transform.

Returns
Xpandas dataframe of shape = [n_samples, n_features].

The dataframe containing the categories replaced by numbers.

rtype

DataFrame ..

Raises
TypeError

If the input is not a Pandas DataFrame

ValueError
  • If the variable(s) contain null values

  • If dataframe is not of same size as that used in fit()

Warning

If after encoding, NAN were introduced.

Example

The PRatioEncoder() replaces the labels by the ratio of probabilities. It only works for binary classification.

The target probability ratio is given by: p(1) / p(0)

The log of the target probability ratio is: np.log( p(1) / p(0) )

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine.encoding import PRatioEncoder, RareLabelEncoder

# Load dataset
def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data['pclass'] = data['pclass'].astype('O')
        data['embarked'].fillna('C', inplace=True)
        return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
                data.drop(['survived', 'name', 'ticket'], axis=1),
                data['survived'], test_size=0.3, random_state=0)

# set up a rare label encoder
rare_encoder = RareLabelEncoder(tol=0.03, n_categories=2, variables=['cabin', 'pclass', 'embarked'])

# fit and transform data
train_t = rare_encoder.fit_transform(X_train)
test_t = rare_encoder.transform(X_train)

# set up a weight of evidence encoder
pratio_encoder = PRatioEncoder(encoding_method='ratio', variables=['cabin', 'pclass', 'embarked'])

# fit the encoder
pratio_encoder.fit(train_t, y_train)

# transform
train_t = pratio_encoder.transform(train_t)
test_t = pratio_encoder.transform(test_t)

pratio_encoder.encoder_dict_
{'cabin': {'B': 3.1999999999999993,
 'C': 1.2903225806451615
 'D': 2.5555555555555554,
 'E': 2.5555555555555554,
 'Rare': 1.3124999999999998,
 'n': 0.4385245901639344},
 'pclass': {1: 1.6136363636363635,
  2: 0.7735849056603774,
  3: 0.34959349593495936},
  'embarked': {'C': 1.2625000000000002,
  'Q': 0.5961538461538461,
  'S': 0.5127610208816704}}