MeanEncoder

API Reference

class feature_engine.encoding.MeanEncoder(variables=None)[source]

The MeanEncoder() replaces categories by the mean value of the target for each category.

For example in the variable colour, if the mean of the target for blue, red and grey is 0.5, 0.8 and 0.1 respectively, blue is replaced by 0.5, red by 0.8 and grey by 0.1.

The encoder will encode only categorical variables (type ‘object’). A list of variables can be passed as an argument. If no variables are passed as argument, the encoder will find and encode all categorical variables (object type).

The encoder first maps the categories to the numbers for each variable (fit). The encoder then replaces the categories with the mapped numbers (transform).

Parameters
variableslist, default=None

The list of categorical variables to encode. If None, the encoder will find and select all object type variables.

Attributes

encoder_dict_ :

Dictionary with the target mean value per category per variable.

Notes

NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().

References

1

Micci-Barreca D. “A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems”. ACM SIGKDD Explorations Newsletter, 2001. https://dl.acm.org/citation.cfm?id=507538

Methods

fit:

Learn the target mean value per category, per variable.

transform:

Encode the categories to numbers.

fit_transform:

Fit to the data, then transform it.

inverse_transform:

Encode the numbers into the original categories.

fit(X, y)[source]

Learn the mean value of the target for each category of the variable.

Parameters
Xpandas dataframe of shape = [n_samples, n_features]

The training input samples. Can be the entire dataframe, not just the variables to be encoded.

ypandas series

The target.

Returns
self
Raises
TypeError
  • If the input is not a Pandas DataFrame.

  • If any user provided variable is not categorical

ValueError
  • If there are no categorical variables in the df or the df is empty

  • If the variable(s) contain null values

inverse_transform(X)[source]

Convert the encoded variable back to the original values.

Parameters
Xpandas dataframe of shape = [n_samples, n_features].

The transformed dataframe.

Returns
Xpandas dataframe of shape = [n_samples, n_features].

The un-transformed dataframe, with the categorical variables containing the original values.

rtype

DataFrame ..

Raises
TypeError
  • If the input is not a Pandas DataFrame

ValueError
  • If the variable(s) contain null values

  • If the dataframe is not of same size as that used in fit()

transform(X)[source]

Replace categories with the learned parameters.

Parameters
Xpandas dataframe of shape = [n_samples, n_features].

The dataset to transform.

Returns
Xpandas dataframe of shape = [n_samples, n_features].

The dataframe containing the categories replaced by numbers.

rtype

DataFrame ..

Raises
TypeError

If the input is not a Pandas DataFrame

ValueError
  • If the variable(s) contain null values

  • If dataframe is not of same size as that used in fit()

Warning

If after encoding, NAN were introduced.

Example

The MeanEncoder() replaces categories with the mean of the target per category. For example, if we are trying to predict default rate, and our data has the variable city, with categories, London, Manchester and Bristol, and the default rate per city is 0.1, 0.5, and 0.3, respectively, the encoder will replace London by 0.1, Manchester by 0.5 and Bristol by 0.3.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine.encoding import MeanEncoder

# Load dataset
def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data['pclass'] = data['pclass'].astype('O')
        data['embarked'].fillna('C', inplace=True)
        return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
                data.drop(['survived', 'name', 'ticket'], axis=1),
                data['survived'], test_size=0.3, random_state=0)

# set up the encoder
encoder = MeanEncoder(variables=['cabin', 'pclass', 'embarked'])

# fit the encoder
encoder.fit(X_train, y_train)

# transform the data
train_t= encoder.transform(X_train)
test_t= encoder.transform(X_test)

encoder.encoder_dict_
{'cabin': {'A': 0.5294117647058824,
  'B': 0.7619047619047619,
  'C': 0.5633802816901409,
  'D': 0.71875,
  'E': 0.71875,
  'F': 0.6666666666666666,
  'G': 0.5,
  'T': 0.0,
  'n': 0.30484330484330485},
 'pclass': {1: 0.6173913043478261,
  2: 0.43617021276595747,
  3: 0.25903614457831325},
 'embarked': {'C': 0.5580110497237569,
  'Q': 0.37349397590361444,
  'S': 0.3389570552147239}}