OrdinalCategoricalEncoder

The OrdinalCategoricalEncoder() replaces the categories by digits, starting from 0 to k-1, where k is the number of different categories. If we select “arbitrary”, then the encoder will assign numbers as the labels appear in the variable (first come first served). If we select “ordered”, the encoder will assign numbers following the mean of the target value for that label. So labels for which the mean of the target is higher will get the number 0, and those where the mean of the target is smallest will get the number k-1.

The OrdinalCategoricalEncoder() works only with categorical variables. A list of variables can be indicated, or the encoder will automatically select all categorical variables in the train set.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine import categorical_encoders as ce

# Load dataset
def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data['pclass'] = data['pclass'].astype('O')
        data['embarked'].fillna('C', inplace=True)
        return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
                data.drop(['survived', 'name', 'ticket'], axis=1),
                data['survived'], test_size=0.3, random_state=0)

# set up the encoder
encoder = ce.OrdinalCategoricalEncoder(encoding_method='ordered',
                                                 variables=['pclass', 'cabin', 'embarked'])

# fit the encoder
encoder.fit(X_train, y_train)

# transform the data
train_t= encoder.transform(X_train)
test_t= encoder.transform(X_test)

encoder.encoder_dict_
{'pclass': {3: 0, 2: 1, 1: 2},
 'cabin': {'T': 0,
  'n': 1,
  'G': 2,
  'A': 3,
  'C': 4,
  'F': 5,
  'D': 6,
  'E': 7,
  'B': 8},
 'embarked': {'S': 0, 'Q': 1, 'C': 2}}

API Reference

class feature_engine.categorical_encoders.OrdinalCategoricalEncoder(encoding_method='ordered', variables=None)[source]

The OrdinalCategoricalEncoder() replaces categories by ordinal numbers (0, 1, 2, 3, etc). The numbers can be ordered based on the mean of the target per category, or assigned arbitrarily.

Ordered ordinal encoding: for the variable colour, if the mean of the target for blue, red and grey is 0.5, 0.8 and 0.1 respectively, blue is replaced by 1, red by 2 and grey by 0.

Arbitrary ordinal encoding: the numbers will be assigned arbitrarily to the categories, on a first seen first served basis.

The encoder will encode only categorical variables (type ‘object’). A list of variables can be passed as an argument. If no variables are passed, the encoder will find and encode all categorical variables (type ‘object’).

The encoder first maps the categories to the numbers for each variable (fit).

The encoder then transforms the categories to the mapped numbers (transform).

Parameters
  • encoding_method (str, default='ordered') –

    Desired method of encoding.

    ’ordered’: the categories are numbered in ascending order according to the target mean value per category.

    ’arbitrary’ : categories are numbered arbitrarily.

  • variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and select all object type variables.

encoder_dict\_

The dictionary containing the {category: ordinal number} pairs for every variable.

Type

dictionary

fit(X, y=None)[source]

Learns the numbers to be used to replace the categories in each variable.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to be encoded.

  • y (pandas series, default=None) – The Target. Can be None if encoding_method = ‘arbitrary’. Otherwise, y needs to be passed when fitting the transformer.

inverse_transform(X)[source]

Convert the data back to the original representation.

Parameters

X_transformed (pandas dataframe of shape = [n_samples, n_features]) – The transformed dataframe.

Returns

X – The un-transformed dataframe, that is, containing the original values of the categorical variables.

Return type

pandas dataframe of shape = [n_samples, n_features]

transform(X)[source]

Replaces categories with the learned parameters.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The input samples.

Returns

X_transformed – The dataframe containing categories replaced by numbers.

Return type

pandas dataframe of shape = [n_samples, n_features]