CountFrequencyEncoder#
The CountFrequencyEncoder()
replaces categories by either the count or the
percentage of observations per category. For example in the variable colour, if 10
observations are blue, blue will be replaced by 10. Alternatively, if 10% of the
observations are blue, blue will be replaced by 0.1.
Let’s look at an example using the Titanic Dataset.
First, let’s load the data and separate it into train and test:
from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.encoding import CountFrequencyEncoder
X, y = load_titanic(
return_X_y_frame=True,
handle_missing=True,
predictors_only=True,
cabin="letter_only",
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0,
)
print(X_train.head())
We see the resulting data below:
pclass sex age sibsp parch fare cabin embarked
501 2 female 13.000000 0 1 19.5000 M S
588 2 female 4.000000 1 1 23.0000 M S
402 2 female 30.000000 1 0 13.8583 M C
1193 3 male 29.881135 0 0 7.7250 M Q
686 3 female 22.000000 0 0 7.7250 M Q
Now, we set up the CountFrequencyEncoder()
to replace the categories by their
frequencies, only in the 3 indicated variables:
encoder = CountFrequencyEncoder(
encoding_method='frequency',
variables=['cabin', 'pclass', 'embarked'],
ignore_format=True,
)
encoder.fit(X_train)
With fit()
the encoder learns the frequencies of each category, which are stored in
its encoder_dict_
parameter:
encoder.encoder_dict_
In the encoder_dict_
we find the frequencies for each one of the categories of each
variable that we want to encode. This way, we can map the original value to the new
value.
{'cabin': {'M': 0.7663755458515283,
'C': 0.07751091703056769,
'B': 0.04585152838427948,
'E': 0.034934497816593885,
'D': 0.034934497816593885,
'A': 0.018558951965065504,
'F': 0.016375545851528384,
'G': 0.004366812227074236,
'T': 0.001091703056768559},
'pclass': {3: 0.5436681222707423,
1: 0.25109170305676853,
2: 0.2052401746724891},
'embarked': {'S': 0.7117903930131004,
'C': 0.19541484716157206,
'Q': 0.0906113537117904,
'Missing': 0.002183406113537118}}
We can now go ahead and replace the original strings with the numbers:
train_t = encoder.transform(X_train)
test_t = encoder.transform(X_test)
print(train_t.head())
Below we see the that the original variables were replaced with the frequencies:
pclass sex age sibsp parch fare cabin embarked
501 0.205240 female 13.000000 0 1 19.5000 0.766376 0.711790
588 0.205240 female 4.000000 1 1 23.0000 0.766376 0.711790
402 0.205240 female 30.000000 1 0 13.8583 0.766376 0.195415
1193 0.543668 male 29.881135 0 0 7.7250 0.766376 0.090611
686 0.543668 female 22.000000 0 0 7.7250 0.766376 0.090611
More details#
In the following notebook, you can find more details into the CountFrequencyEncoder()
functionality and example plots with the encoded variables:
For more details about this and other feature engineering methods check out these resources:
Feature engineering for machine learning, online course.