CountFrequencyCategoricalEncoder

The CountFrequencyCategoricalEncoder() replaces categories with the number of observations or percentage of observations per category. For example, if 10 observations show the category blue for the variable color, blue will be replaced by 10. If, using frequency, if 20% of observations show the category red, red will be replaced by 0.20. The CountFrequencyCategoricalEncoder() works only with categorical variables. A list of variables can be indiacated, or the imputer will automatically select all categorical variables in the train set.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine import categorical_encoders as ce

# Load dataset
def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data['pclass'] = data['pclass'].astype('O')
        data['embarked'].fillna('C', inplace=True)
        return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
                data.drop(['survived', 'name', 'ticket'], axis=1),
                data['survived'], test_size=0.3, random_state=0)

# set up the encoder
encoder = ce.CountFrequencyCategoricalEncoder(encoding_method='frequency',
                         variables=['cabin', 'pclass', 'embarked'])

# fit the encoder
encoder.fit(X_train)

# transform the data
train_t= encoder.transform(X_train)
test_t= encoder.transform(X_test)

encoder.encoder_dict_
{'cabin': {'n': 0.7663755458515283,
  'C': 0.07751091703056769,
  'B': 0.04585152838427948,
  'E': 0.034934497816593885,
  'D': 0.034934497816593885,
  'A': 0.018558951965065504,
  'F': 0.016375545851528384,
  'G': 0.004366812227074236,
  'T': 0.001091703056768559},
 'pclass': {3: 0.5436681222707423,
  1: 0.25109170305676853,
  2: 0.2052401746724891},
 'embarked': {'S': 0.7117903930131004,
  'C': 0.19759825327510916,
  'Q': 0.0906113537117904}}

API Reference

class feature_engine.categorical_encoders.CountFrequencyCategoricalEncoder(encoding_method='count', variables=None)[source]

The CountFrequencyCategoricalEncoder() replaces categories by the count of observations per category or by the percentage of observations per category.

For example in the variable colour, if 10 observations are blue, blue will be replaced by 10. Alternatively, if 10% of the observations are blue, blue will be replaced by 0.1.

The CountFrequencyCategoricalEncoder() will encode only categorical variables (type ‘object’). A list of variables can be passed as an argument. If no variables are passed as argument, the encoder will only encode categorical variables (object type) and ignore the rest.

The encoder first maps the categories to the numbers for each variable (fit). The encoder then transforms the categories to those mapped numbers (transform).

Parameters:
  • encoding_method (str, default='count') – Desired method of encoding. ‘count’: number of observations per category ‘frequency’ : percentage of observations per category
  • variables (list) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all object type variables.
encoder_dict_

The dictionary containing the {count / frequency: category} pairs used to replace categories for every variable.

Type:dictionary
fit(self, X, y=None)[source]

Learns the numbers that should be used to replace the categories in each variable.

Parameters:
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just seleted variables.
  • y (None) – y is not needed in this encoder, yet the sklearn pipeline API requires this parameter for checking. You can either leave it as None or pass y.