OrdinalEncoder

API Reference

class feature_engine.encoding.OrdinalEncoder(encoding_method='ordered', variables=None)[source]

The OrdinalCategoricalEncoder() replaces categories by ordinal numbers (0, 1, 2, 3, etc). The numbers can be ordered based on the mean of the target per category, or assigned arbitrarily.

Ordered ordinal encoding: for the variable colour, if the mean of the target for blue, red and grey is 0.5, 0.8 and 0.1 respectively, blue is replaced by 1, red by 2 and grey by 0.

Arbitrary ordinal encoding: the numbers will be assigned arbitrarily to the categories, on a first seen first served basis.

The encoder will encode only categorical variables (type ‘object’). A list of variables can be passed as an argument. If no variables are passed, the encoder will find and encode all categorical variables (type ‘object’).

The encoder first maps the categories to the numbers for each variable (fit). The encoder then transforms the categories to the mapped numbers (transform).

Parameters
encoding_methodstr, default=’ordered’

Desired method of encoding.

‘ordered’: the categories are numbered in ascending order according to the target mean value per category.

‘arbitrary’ : categories are numbered arbitrarily.

variableslist, default=None

The list of categorical variables that will be encoded. If None, the encoder will find and select all object type variables.

Attributes

encoder_dict_ :

Dictionary with the ordinal number per category, per variable.

Notes

NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().

References

Encoding into integers ordered following target mean was discussed in the following talk at PyData London 2017:

1

Galli S. “Machine Learning in Financial Risk Assessment”. https://www.youtube.com/watch?v=KHGGlozsRtA

Methods

fit:

Find the integer to replace each category in each variable.

transform:

Encode the categories to numbers.

fit_transform:

Fit to the data, then transform it.

inverse_transform:

Encode the numbers into the original categories.

fit(X, y=None)[source]

Learn the numbers to be used to replace the categories in each variable.

Parameters
Xpandas dataframe of shape = [n_samples, n_features]

The training input samples. Can be the entire dataframe, not just the variables to be encoded.

ypandas series, default=None

The Target. Can be None if encoding_method = ‘arbitrary’. Otherwise, y needs to be passed when fitting the transformer.

Returns
self
Raises
TypeError
  • If the input is not a Pandas DataFrame.

  • If any user provided variable is not categorical

ValueError
  • If there are no categorical variables in the df or the df is empty

  • If the variable(s) contain null values

inverse_transform(X)[source]

Convert the encoded variable back to the original values.

Parameters
Xpandas dataframe of shape = [n_samples, n_features].

The transformed dataframe.

Returns
Xpandas dataframe of shape = [n_samples, n_features].

The un-transformed dataframe, with the categorical variables containing the original values.

rtype

DataFrame ..

Raises
TypeError
  • If the input is not a Pandas DataFrame

ValueError
  • If the variable(s) contain null values

  • If the dataframe is not of same size as that used in fit()

transform(X)[source]

Replace categories with the learned parameters.

Parameters
Xpandas dataframe of shape = [n_samples, n_features].

The dataset to transform.

Returns
Xpandas dataframe of shape = [n_samples, n_features].

The dataframe containing the categories replaced by numbers.

rtype

DataFrame ..

Raises
TypeError

If the input is not a Pandas DataFrame

ValueError
  • If the variable(s) contain null values

  • If dataframe is not of same size as that used in fit()

Warning

If after encoding, NAN were introduced.

Example

The OrdinalEncoder() replaces the categories by digits, starting from 0 to k-1, where k is the number of different categories. If we select “arbitrary”, then the encoder will assign numbers as the labels appear in the variable (first come first served). If we select “ordered”, the encoder will assign numbers following the mean of the target value for that label. So labels for which the mean of the target is higher will get the number 0, and those where the mean of the target is smallest will get the number k-1. This way, we create a monotonic relationship between the encoded variable and the target.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine.encoding import OrdinalEncoder

# Load dataset
def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data['pclass'] = data['pclass'].astype('O')
        data['embarked'].fillna('C', inplace=True)
        return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
                data.drop(['survived', 'name', 'ticket'], axis=1),
                data['survived'], test_size=0.3, random_state=0)

# set up the encoder
encoder = OrdinalEncoder(encoding_method='ordered', variables=['pclass', 'cabin', 'embarked'])

# fit the encoder
encoder.fit(X_train, y_train)

# transform the data
train_t= encoder.transform(X_train)
test_t= encoder.transform(X_test)

encoder.encoder_dict_
{'pclass': {3: 0, 2: 1, 1: 2},
 'cabin': {'T': 0,
  'n': 1,
  'G': 2,
  'A': 3,
  'C': 4,
  'F': 5,
  'D': 6,
  'E': 7,
  'B': 8},
 'embarked': {'S': 0, 'Q': 1, 'C': 2}}