CountFrequencyEncoder¶
API Reference¶
-
class
feature_engine.encoding.
CountFrequencyEncoder
(encoding_method='count', variables=None)[source]¶ The CountFrequencyEncoder() replaces categories by either the count or the percentage of observations per category.
For example in the variable colour, if 10 observations are blue, blue will be replaced by 10. Alternatively, if 10% of the observations are blue, blue will be replaced by 0.1.
The CountFrequencyEncoder() will encode only categorical variables (type ‘object’). A list of variables to encode can be passed as argument. Alternatively, the encoder will find and encode all categorical variables (object type).
The encoder first maps the categories to the counts or frequencies for each variable (fit). The encoder then replaces the categories by those mapped numbers (transform).
- Parameters
- encoding_methodstr, default=’count’
Desired method of encoding.
‘count’: number of observations per category
‘frequency’: percentage of observations per category
- variableslist
The list of categorical variables that will be encoded. If None, the encoder will find and transform all object type variables.
Attributes
encoder_dict_:
Dictionary with the count or frequency} per category, per variable.
Notes
NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().
Methods
fit:
Learn the count or frequency per category, per variable.
transform:
Encode the categories to numbers.
fit_transform:
Fit to the data, then transform it.
inverse_transform:
Encode the numbers into the original categories.
-
fit
(X, y=None)[source]¶ Learn the counts or frequencies which will be used to replace the categories.
- Parameters
- Xpandas dataframe of shape = [n_samples, n_features]
The training dataset. Can be the entire dataframe, not just the variables to be transformed.
- ypandas Series, default = None
y is not needed in this encoder. You can pass y or None.
- Returns
- self
- Raises
- TypeError
If the input is not a Pandas DataFrame.
If any user provided variable is not categorical
- ValueError
If there are no categorical variables in the df or the df is empty
If the variable(s) contain null values
-
inverse_transform
(X)[source]¶ Convert the encoded variable back to the original values.
- Parameters
- Xpandas dataframe of shape = [n_samples, n_features].
The transformed dataframe.
- Returns
- Xpandas dataframe of shape = [n_samples, n_features].
The un-transformed dataframe, with the categorical variables containing the original values.
- rtype
DataFrame
..
- Raises
- TypeError
If the input is not a Pandas DataFrame
- ValueError
If the variable(s) contain null values
If the dataframe is not of same size as that used in fit()
-
transform
(X)[source]¶ Replace categories with the learned parameters.
- Parameters
- Xpandas dataframe of shape = [n_samples, n_features].
The dataset to transform.
- Returns
- Xpandas dataframe of shape = [n_samples, n_features].
The dataframe containing the categories replaced by numbers.
- rtype
DataFrame
..
- Raises
- TypeError
If the input is not a Pandas DataFrame
- ValueError
If the variable(s) contain null values
If dataframe is not of same size as that used in fit()
- Warning
If after encoding, NAN were introduced.
Example¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.encoding import CountFrequencyEncoder
# Load dataset
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
return data
data = load_titanic()
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['survived', 'name', 'ticket'], axis=1),
data['survived'], test_size=0.3, random_state=0)
# set up the encoder
encoder = CountFrequencyEncoder(encoding_method='frequency',
variables=['cabin', 'pclass', 'embarked'])
# fit the encoder
encoder.fit(X_train)
# transform the data
train_t= encoder.transform(X_train)
test_t= encoder.transform(X_test)
encoder.encoder_dict_
{'cabin': {'n': 0.7663755458515283,
'C': 0.07751091703056769,
'B': 0.04585152838427948,
'E': 0.034934497816593885,
'D': 0.034934497816593885,
'A': 0.018558951965065504,
'F': 0.016375545851528384,
'G': 0.004366812227074236,
'T': 0.001091703056768559},
'pclass': {3: 0.5436681222707423,
1: 0.25109170305676853,
2: 0.2052401746724891},
'embarked': {'S': 0.7117903930131004,
'C': 0.19759825327510916,
'Q': 0.0906113537117904}}