CountFrequencyCategoricalEncoder

The CountFrequencyCategoricalEncoder() replaces categories with the number of observations or percentage of observations per category. For example, if 10 observations show the category blue for the variable color, blue will be replaced by 10. If, using frequency, if 20% of observations show the category red, red will be replaced by 0.20.

The CountFrequencyCategoricalEncoder() works only with categorical variables. A list of variables can be indicated, or the encoder will automatically select all categorical variables in the train set.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine import categorical_encoders as ce

# Load dataset
def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data['pclass'] = data['pclass'].astype('O')
        data['embarked'].fillna('C', inplace=True)
        return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
                data.drop(['survived', 'name', 'ticket'], axis=1),
                data['survived'], test_size=0.3, random_state=0)

# set up the encoder
encoder = ce.CountFrequencyCategoricalEncoder(encoding_method='frequency',
                         variables=['cabin', 'pclass', 'embarked'])

# fit the encoder
encoder.fit(X_train)

# transform the data
train_t= encoder.transform(X_train)
test_t= encoder.transform(X_test)

encoder.encoder_dict_
{'cabin': {'n': 0.7663755458515283,
  'C': 0.07751091703056769,
  'B': 0.04585152838427948,
  'E': 0.034934497816593885,
  'D': 0.034934497816593885,
  'A': 0.018558951965065504,
  'F': 0.016375545851528384,
  'G': 0.004366812227074236,
  'T': 0.001091703056768559},
 'pclass': {3: 0.5436681222707423,
  1: 0.25109170305676853,
  2: 0.2052401746724891},
 'embarked': {'S': 0.7117903930131004,
  'C': 0.19759825327510916,
  'Q': 0.0906113537117904}}

API Reference

class feature_engine.categorical_encoders.CountFrequencyCategoricalEncoder(encoding_method='count', variables=None)[source]

The CountFrequencyCategoricalEncoder() replaces categories by the count of observations per category or by the percentage of observations per category.

For example in the variable colour, if 10 observations are blue, blue will be replaced by 10. Alternatively, if 10% of the observations are blue, blue will be replaced by 0.1.

The CountFrequencyCategoricalEncoder() will encode only categorical variables (type ‘object’). A list of variables to be encoded can be passed as argument. Alternatively, the encoder will find and encode all categorical variables (object type).

The encoder first maps the categories to the numbers (counts or frequencies) for each variable (fit).

The encoder then transforms the categories to those mapped numbers (transform).

Parameters
  • encoding_method (str, default='count') –

    Desired method of encoding.

    ’count’: number of observations per category

    ’frequency’: percentage of observations per category

  • variables (list) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all object type variables.

fit(X, y=None)[source]

Learns the counts or frequencies which will be used to replace the categories.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. The user can pass the entire dataframe.

  • y (None) – y is not needed in this encoder. You can pass y or None.

encoder_dict\_

Dictionary containing the {category: count / frequency} pairs for each variable.

Type

dictionary

inverse_transform(X)[source]

Convert the data back to the original representation.

Parameters

X_transformed (pandas dataframe of shape = [n_samples, n_features]) – The transformed dataframe.

Returns

X – The un-transformed dataframe, that is, containing the original values of the categorical variables.

Return type

pandas dataframe of shape = [n_samples, n_features]

transform(X)[source]

Replaces categories with the learned parameters.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The input samples.

Returns

X_transformed – The dataframe containing categories replaced by numbers.

Return type

pandas dataframe of shape = [n_samples, n_features]