MeanEncoder#

Mean encoding is the process of replacing the categories in categorical features by the mean value of the target variable shown by each category. For example, if we are trying to predict the default rate (that’s the target variable), and our dataset has the categorical variable City, with the categories of London, Manchester, and Bristol, and the default rate per city is 0.1, 0.5, and 0.3, respectively, with mean encoding, we would replace London by 0.1, Manchester by 0.5, and Bristol by 0.3.

Mean encoding, together with one hot encoding and ordinal encoding, belongs to the most commonly used categorical encoding techniques in data science.

It is said that mean encoding can easily cause overfitting. That’s because we are capturing some information about the target into the predictive features during the encoding. More importantly, the overfitting can be caused by encoding categories with low frequencies with mean target values that are unreliable. In short, the mean target values seen for those categories in the training set do not hold for test data or new observations.

Overfitting#

When the categories in the categorical features have a good representation, or, in other words, when there are enough observations in our dataset that show the categories that we want to encode, then taking the simple average of the target variable per category is a good approximation. We can trust that a new data point, say from the test data, that shows that category will also have a target value that is similar to the target mean value that we calculated for said category during training.

However, if there are only a few observations that show some of the categories, then the mean target value for those categories will be unreliable. In other words, the certainty that we have that a new observation that shows this category will have a mean target value close to the one we estimated decreases.

To account for the uncertainty of the encoding values for rare categories, what we normally do is “blend” the mean target variable per category with the general mean of the target, calculated over the entire training dataset. And this blending is proportional to the variability of the target within that category and the category frequency.

Smoothing#

To avoid overfitting, we can determine the mean target value estimates as a mixture of two values: the mean target value per category (known as the posterior) and the mean target value in the entire dataset (known as the prior).

The following formula shows the estimation of the mean target value with smoothing:

\[mapping = (w_i) posterior + (1-w_i) prior\]

The prior and posterior values are “blended” using a weighting factor (wi). This weighting factor is a function of the category group size (n_i) and the variance of the target in the data (t) and within the category (s):

\[w_i = n_i t / (s + n_i t)\]

When the category group is large, the weighing factor is close to 1, and therefore more weight is given to the posterior (the mean of the target per category). When the category group size is small, then the weight gets closer to 0, and more weight is given to the prior (the mean of the target in the entire dataset).

In addition, if the variability of the target within that category is large, we also give more weight to the prior, whereas if it is small, then we give more weight to the posterior.

In short, adding smoothing can help prevent overfitting in those cases where categorical data have many infrequent categories or show high cardinality.

High cardinality#

High cardinality refers to a high number of unique categories in the categorical features. Mean encoding was specifically designed to tackle highly cardinal variables by taking advantage of this smoothing function, which will essentially blend infrequent categories together by replacing them with values very close to the overall target mean calculated over the training data.

Another encoding method that tackles cardinality out of the box is count encoding. See for example CountFrequencyEncoder.

To account for highly cardinal variables in alternative encoding methods, you can group rare categories together by using the RareLabelEncoder.

Alternative Python implementations of mean encoding#

In Feature-engine, we blend the probabilities considering the target variability and the category frequency. In the original paper, there are alternative formulations to determine the blending. If you want to check those out, use the transformers from the Python library Category encoders:

Mean encoder#

Feature-engine’s MeanEncoder() replaces categories with the mean of the target per category. By default, it does not implement smoothing. That means that it will replace categories by the mean target value as determined during training over the training data set (just the posterior).

To apply smoothing using the formulation that we described earlier, set the parameter smoothing to "auto". That would be our recommended solution. Alternatively, you can set the parameter smoothing to any value that you want, in which case the weighting factor wi will be calculated like this:

\[w_i = n_i / (s + n_i)\]

where s is the value your pass to smoothing.

Unseen categories#

Unseen categories are those labels that were not seen during training. Or in other words, categories that were not present in the training data.

With the MeanEncoder(), we can take care of unseen categories in 1 of 3 ways:

  • We can set the mean encoder to ignore unseen categories, in which case those categories will be replaced by nan.

  • We can set the mean encoder to raise an error when it encounters unseen categories. This is useful when we don’t expect new categories for those categorical variables.

  • We can instruct the mean encoder to replace unseen or new categories with the mean of the target shown in the training data, that is, the prior.

Mean encoding and machine learning#

Feature-engine’s MeanEncoder() can perform mean encoding for regression and binary classification datasets. At the moment, we do not support multi-class targets.

Python examples#

In the following sections, we’ll show the functionality of MeanEncoder() using the Titanic Dataset.

First, let’s load the libraries, functions and classes:

from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.encoding import MeanEncoder

To avoid data leakage, it is important to separate the data into training and test sets. The mean target values, with or without smoothing, will be determined using the training data only.

Let’s load and split the data:

X, y = load_titanic(
    return_X_y_frame=True,
    handle_missing=True,
    predictors_only=True,
    cabin="letter_only",
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0,
)

print(X_train.head())

We see the resulting dataframe containing 3 categorical columns: sex, cabin and embarked:

      pclass     sex        age  sibsp  parch     fare cabin embarked
501        2  female  13.000000      0      1  19.5000     M        S
588        2  female   4.000000      1      1  23.0000     M        S
402        2  female  30.000000      1      0  13.8583     M        C
1193       3    male  29.881135      0      0   7.7250     M        Q
686        3  female  22.000000      0      0   7.7250     M        Q

Simple mean encoding#

Let’s set up the MeanEncoder() to replace the categories in the categorical features with the target mean, without smoothing:

encoder = MeanEncoder(
    variables=['cabin', 'sex', 'embarked'],
)

encoder.fit(X_train, y_train)

With fit() the encoder learns the target mean value for each category and stores those values in the encoder_dict_ attribute:

encoder.encoder_dict_

The encoder_dict_ contains the mean value of the target per category, per variable. We can use this dictionary to map the numbers in the encoded features to the original categorical values.

{'cabin': {'A': 0.5294117647058824,
  'B': 0.7619047619047619,
  'C': 0.5633802816901409,
  'D': 0.71875,
  'E': 0.71875,
  'F': 0.6666666666666666,
  'G': 0.5,
  'M': 0.30484330484330485,
  'T': 0.0},
 'sex': {'female': 0.7283582089552239, 'male': 0.18760757314974183},
 'embarked': {'C': 0.553072625698324,
  'Missing': 1.0,
  'Q': 0.37349397590361444,
  'S': 0.3389570552147239}}

We can now go ahead and replace the categorical values with the numerical values:

train_t = encoder.transform(X_train)
test_t = encoder.transform(X_test)

print(train_t.head())

Below we see the resulting dataframe, where the categorical values are now replaced with the target mean values:

      pclass       sex        age  sibsp  parch     fare     cabin  embarked
501        2  0.728358  13.000000      0      1  19.5000  0.304843  0.338957
588        2  0.728358   4.000000      1      1  23.0000  0.304843  0.338957
402        2  0.728358  30.000000      1      0  13.8583  0.304843  0.553073
1193       3  0.187608  29.881135      0      0   7.7250  0.304843  0.373494
686        3  0.728358  22.000000      0      0   7.7250  0.304843  0.373494

Mean encoding with smoothing#

By default, MeanEncoder() determines the mean target values without blending. If we want to apply smoothing to control the cardinality of the variable and avoid overfitting, we set up the transformer as follows:

encoder = MeanEncoder(
    variables=None,
    smoothing="auto"
)

encoder.fit(X_train, y_train)

In this example, we did not indicate which variables to encode. MeanEncoder() can automatically find the categorical variables, which are stored in one of its attributes:

encoder.variables_

Below we see the categorical features found by MeanEncoder():

['sex', 'cabin', 'embarked']

We can find the categorical mappings calculated by the mean encoder:

encoder.encoder_dict_

Note that these values are different to those determined without smoothing:

{'sex': {'female': 0.7275051072923914, 'male': 0.18782635616273297},
 'cabin': {'A': 0.5210189753697639,
  'B': 0.755161569137655,
  'C': 0.5608140829162441,
  'D': 0.7100896537503179,
  'E': 0.7100896537503179,
  'F': 0.6501082490288561,
  'G': 0.47606795923242295,
  'M': 0.3049458046855866,
  'T': 0.0},
 'embarked': {'C': 0.552100581239763,
  'Missing': 1.0,
  'Q': 0.3736336816011083,
  'S': 0.3390242994568531}}

We can now go ahead and replace the categorical values with the numerical values:

train_t = encoder.transform(X_train)
test_t = encoder.transform(X_test)

print(train_t.head())

Below we see the resulting dataframe with the encoded features:

      pclass       sex        age  sibsp  parch     fare     cabin  embarked
501        2  0.727505  13.000000      0      1  19.5000  0.304946  0.339024
588        2  0.727505   4.000000      1      1  23.0000  0.304946  0.339024
402        2  0.727505  30.000000      1      0  13.8583  0.304946  0.552101
1193       3  0.187826  29.881135      0      0   7.7250  0.304946  0.373634
686        3  0.727505  22.000000      0      0   7.7250  0.304946  0.373634

We can now use this dataframes to train machine learning models for regression or classification.

Mean encoding variables with numerical values#

MeanEncoder(), and all Feature-engine encoders, have been designed to work with variables of type object or categorical by default. If you want to encode variables that are numeric, you need to instruct the transformer to ignore the data type:

encoder = MeanEncoder(
    variables=['cabin', 'pclass'],
    ignore_format=True,
)

t_train = encoder.fit_transform(X_train, y_train)
t_test = encoder.transform(X_test)

After encoding the features we can use the data sets to train machine learning algorithms.

Last thing to note before closing in is that mean encoding does not increase the dimensionality of the resulting dataframes: from 1 categorical feature, we obtain 1 encoded variable. Hence, this encoding method is suitable for predictive modeling that uses models that are sensitive to the size of the feature space.

Additional resources#

In the following notebook, you can find more details into the MeanEncoder() functionality and example plots with the encoded variables:

For tutorials about this and other feature engineering methods check out these resources:

../../_images/feml.png

Feature Engineering for Machine Learning#

../../_images/fetsf.png

Feature Engineering for Time Series Forecasting#











Or read our book:

../../_images/cookbook.png

Python Feature Engineering Cookbook#














Both our book and courses are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.