OneHotCategoricalEncoder

The OneHotCategoricalEncoder() replaces categorical variables by a set of binary variables, one per unique category. The encoder has the option to create k or k-1 binary variables, where k is the number of unique categories.

The encoder can also create binary variables for the n most popular categories, n being determined by the user. This means, if we encode the 6 more popular categories, we will only create binary variables for those categories, and the rest will be dropped.

The OneHotCategoricalEncoder() works only with categorical variables. A list of variables can be indicated, or the encoder will automatically select all categorical variables in the train set.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine import categorical_encoders as ce

# Load dataset
def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data['pclass'] = data['pclass'].astype('O')
        data['embarked'].fillna('C', inplace=True)
        return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
                        data.drop(['survived', 'name', 'ticket'], axis=1),
                        data['survived'], test_size=0.3, random_state=0)

# set up the encoder
encoder = ce.OneHotCategoricalEncoder(
    top_categories=2,
    variables=['pclass', 'cabin', 'embarked'],
    drop_last=False)

# fit the encoder
encoder.fit(X_train)

# transform the data
train_t= encoder.transform(X_train)
test_t= encoder.transform(X_test)

encoder.encoder_dict_
{'pclass': [3, 1], 'cabin': ['n', 'C'], 'embarked': ['S', 'C']}

API Reference

class feature_engine.categorical_encoders.OneHotCategoricalEncoder(top_categories=None, variables=None, drop_last=False)[source]

One hot encoding consists in replacing the categorical variable by a combination of binary variables which take value 0 or 1, to indicate if a certain category is present in an observation.

Each one of the binary variables are also known as dummy variables. For example, from the categorical variable “Gender” with categories ‘female’ and ‘male’, we can generate the boolean variable “female”, which takes 1 if the person is female or 0 otherwise. We can also generate the variable male, which takes 1 if the person is “male” and 0 otherwise.

The encoder has the option to generate one dummy variable per category, or to create dummy variables only for the top n most popular categories, that is, the categories that are shown by the majority of the observations.

If dummy variables are created for all the categories of a variable, you have the option to drop one category not to create information redundancy. That is, encoding into k-1 variables, where k is the number if unique categories.

The encoder will encode only categorical variables (type ‘object’). A list of variables can be passed as an argument. If no variables are passed as argument, the encoder will find and encode categorical variables (object type).

The encoder first finds the categories to be encoded for each variable (fit).

The encoder then creates one dummy variable per category for each variable (transform).

Note: new categories in the data to transform, that is, those that did not appear in the training set, will be ignored (no binary variable will be created for them).

Parameters
  • top_categories (int, default=None) – If None, a dummy variable will be created for each category of the variable. Alternatively, top_categories indicates the number of most frequent categories to encode. Dummy variables will be created only for those popular categories and the rest will be ignored. Note that this is equivalent to grouping all the remaining categories in one group.

  • variables (list) – The list of categorical variables that will be encoded. If None, the encoder will find and select all object type variables.

  • drop_last (boolean, default=False) – Only used if top_categories = None. It indicates whether to create dummy variables for all the categories (k dummies), or if set to True, it will ignore the last variable of the list (k-1 dummies).

fit(X, y=None)[source]

Learns the unique categories per variable. If top_categories is indicated, it will learn the most popular categories. Alternatively, it learns all unique categories per variable.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just seleted variables.

  • y (pandas series, default=None) – Target. It is not needed in this encoded. You can pass y or None.

encoder_dict\_

The dictionary containing the categories for which dummy variables will be created.

Type

dictionary

transform(X)[source]

Creates the dummy / binary variables.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The data to transform.

Returns

X_transformed – The shape of the dataframe will be different from the original as it includes the dummy variables.

Return type

pandas dataframe.