CountFrequencyEncoder

API Reference

class feature_engine.encoding.CountFrequencyEncoder(encoding_method='count', variables=None, ignore_format=False)[source]

The CountFrequencyEncoder() replaces categories by either the count or the percentage of observations per category.

For example in the variable colour, if 10 observations are blue, blue will be replaced by 10. Alternatively, if 10% of the observations are blue, blue will be replaced by 0.1.

The CountFrequencyEncoder() will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode. Alternatively, the encoder will find and encode all categorical variables (type ‘object’ or ‘categorical’).

With ignore_format=True you have the option to encode numerical variables as well. The procedure is identical, you can either enter the list of variables to encode, or the transformer will automatically select all variables.

The encoder first maps the categories to the counts or frequencies for each variable (fit). The encoder then replaces the categories with those numbers (transform).

Parameters
encoding_method: str, default=’count’

Desired method of encoding.

‘count’: number of observations per category

‘frequency’: percentage of observations per category

variables: list, default=None

The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.

ignore_format: bool, default=False

Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.

Attributes

encoder_dict_:

Dictionary with the count or frequency per category, per variable.

variables_:

The group of variables that will be transformed.

n_features_in_:

The number of features in the train set used in fit.

Notes

NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().

Methods

fit:

Learn the count or frequency per category, per variable.

transform:

Encode the categories to numbers.

fit_transform:

Fit to the data, then transform it.

inverse_transform:

Encode the numbers into the original categories.

fit(X, y=None)[source]

Learn the counts or frequencies which will be used to replace the categories.

Parameters
X: pandas dataframe of shape = [n_samples, n_features]

The training dataset. Can be the entire dataframe, not just the variables to be transformed.

y: pandas Series, default = None

y is not needed in this encoder. You can pass y or None.

Returns
self
Raises
TypeError
  • If the input is not a Pandas DataFrame.

  • f user enters non-categorical variables (unless ignore_format is True)

ValueError
  • If there are no categorical variables in the df or the df is empty

  • If the variable(s) contain null values

inverse_transform(X)[source]

Convert the encoded variable back to the original values.

Parameters
X: pandas dataframe of shape = [n_samples, n_features].

The transformed dataframe.

Returns
X: pandas dataframe of shape = [n_samples, n_features].

The un-transformed dataframe, with the categorical variables containing the original values.

rtype

DataFrame ..

Raises
TypeError

If the input is not a Pandas DataFrame

ValueError
  • If the variable(s) contain null values

  • If the df has different number of features than the df used in fit()

transform(X)[source]

Replace categories with the learned parameters.

Parameters
X: pandas dataframe of shape = [n_samples, n_features].

The dataset to transform.

Returns
X: pandas dataframe of shape = [n_samples, n_features].

The dataframe containing the categories replaced by numbers.

rtype

DataFrame ..

Raises
TypeError

If the input is not a Pandas DataFrame

ValueError
  • If the variable(s) contain null values

  • If the df has different number of features than the df used in fit()

Warning

If after encoding, NAN were introduced.

Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine.encoding import CountFrequencyEncoder

# Load dataset
def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data['pclass'] = data['pclass'].astype('O')
        data['embarked'].fillna('C', inplace=True)
        return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
                data.drop(['survived', 'name', 'ticket'], axis=1),
                data['survived'], test_size=0.3, random_state=0)

# set up the encoder
encoder = CountFrequencyEncoder(encoding_method='frequency',
                         variables=['cabin', 'pclass', 'embarked'])

# fit the encoder
encoder.fit(X_train)

# transform the data
train_t= encoder.transform(X_train)
test_t= encoder.transform(X_test)

encoder.encoder_dict_
{'cabin': {'n': 0.7663755458515283,
  'C': 0.07751091703056769,
  'B': 0.04585152838427948,
  'E': 0.034934497816593885,
  'D': 0.034934497816593885,
  'A': 0.018558951965065504,
  'F': 0.016375545851528384,
  'G': 0.004366812227074236,
  'T': 0.001091703056768559},
 'pclass': {3: 0.5436681222707423,
  1: 0.25109170305676853,
  2: 0.2052401746724891},
 'embarked': {'S': 0.7117903930131004,
  'C': 0.19759825327510916,
  'Q': 0.0906113537117904}}