OneHotCategoricalEncoder

The OneHotCategoricalEncoder() replaces original categorical variable, by a set of binary variables, one per unique category. The encoder has the option to create k or k-1 binary variables, where k is the number of unique categories. The encoder can also create binary variables by the n most popular categories, n being determined by the user.

The OneHotCategoricalEncoder() works only with categorical variables. A list of variables can be indiacated, or the imputer will automatically select all categorical variables in the train set.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine import categorical_encoders as ce

# Load dataset
def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data['pclass'] = data['pclass'].astype('O')
        data['embarked'].fillna('C', inplace=True)
        return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
                        data.drop(['survived', 'name', 'ticket'], axis=1),
                        data['survived'], test_size=0.3, random_state=0)

# set up the encoder
encoder = ce.OneHotCategoricalEncoder(
    top_categories=2,
    variables=['pclass', 'cabin', 'embarked'],
    drop_last=False)

# fit the encoder
encoder.fit(X_train)

# transform the data
train_t= encoder.transform(X_train)
test_t= encoder.transform(X_test)

encoder.encoder_dict_
{'pclass': [3, 1], 'cabin': ['n', 'C'], 'embarked': ['S', 'C']}

API Reference

class feature_engine.categorical_encoders.OneHotCategoricalEncoder(top_categories=None, variables=None, drop_last=False)[source]

One hot encoding consists in replacing the categorical variable by a combination of boolean variables which take value 0 or 1, to indicate if a certain category is present for an observation.

Each one of the boolean variables are also known as dummy variables or binary variables. For example, from the categorical variable “Gender” with categories ‘female’ and ‘male’, we can generate the boolean variable “female”, which takes 1 if the person is female or 0 otherwise. We can also generate the variable male, which takes 1 if the person is “male” and 0 otherwise.

The encoder has the option to generate one dummy variable per category present in a variable, or to create dummy variables only for the top n most popular categories, that is, the categories that are present in the majority of the observations.

If dummy variables are created for all the categories of a variable, you have the option to drop one category not to create information redundancy (encoding into k-1 variables, where k is the number if unique categories).

The Encoder will encode only categorical variables (type ‘object’). A list of variables can be passed as an argument. If no variables are passed as argument, the encoder will only encode categorical variables (object type) and ignore the rest.

The encoder first finds the categories to be encoded for each variable (fit). The encoder then creates one dummy variable per category for each variable (transform).

Parameters:
  • top_categories (int, default=None) – If None is selected, a dummy variable will be created for each category per variable.If set to True, the encoder will find the most frequent categories. top_categories indicates the number of most frequent categories to encode. Dummy variables will be created only for those popular categories and the rest will be dropped. Note that this is equivalent to grouping all the remaining categories in one group.
  • variables (list) – The list of categorical variables that will be encoded. If None, the encoder will find and select all object type variables.
  • drop_last (boolean, default=False) – Only used if top_categories = None. It indicates whether to create dummy variables for all the available categories, or if set to True, it will ignore the last variable of the list.
encoder_dict_

The dictionary containg the frequent categories (that will be kept) for each variable.

Type:dictionary
fit(self, X, y=None)[source]

Learns the unique categories per variable. If top_categories is indicated it will learn the most popular categories. Alternatively, it learns all unique categories per variable.

Parameters:
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just seleted variables.
  • y (Target) –
transform(self, X)[source]

Creates the dummy / boolean variables.

Parameters:X (pandas dataframe of shape = [n_samples, n_features]) – The input samples.
Returns:
  • X_transformed (pandas dataframe. The shape of the dataframe will)
  • be different from the original as it includes the dummy variables.