CategoricalImputer

API Reference

class feature_engine.imputation.CategoricalImputer(imputation_method='missing', fill_value='Missing', variables=None, return_object=False)[source]

The CategoricalImputer() replaces missing data in categorical variables by a string like ‘Missing’ or any other entered by the user. Alternatively, it replaces missing data by the most frequent category.

The CategoricalVariableImputer() works only with categorical variables.

The user can pass a list with the variables to be imputed. Alternatively, the CategoricalImputer() will automatically find and select all variables of type object.

Note

If you want to impute numerical variables with this transformer, you first need to cast them as object. It may well be that after the imputation, they are re-casted by pandas as numeric. Thus, if planning to do categorical encoding with feature-engine to this variables after the imputation, make sure to return the variables as object by setting return_object=True.

Parameters
imputation_methodstr, default=missing

Desired method of imputation. Can be ‘frequent’ or ‘missing’.

fill_valuestr, default=’Missing’

Only used when imputation_method='missing'. Can be used to set a user-defined value to replace the missing data.

variableslist, default=None

The list of variables to be imputed. If None, the imputer will find and select all object type variables.

return_object: bool, default=False

If working with numerical variables cast as object, decide whether to return the variables as numeric or re-cast them as object. Note that pandas will re-cast them automatically as numeric after the transformation with the mode.

Attributes

imputer_dict_:

Dictionary with most frequent category or string per variable.

Methods

fit:

Learn more frequent category, or assign string to variable.

transform:

Impute missing data.

fit_transform:

Fit to the data, than transform it.

fit(X, y=None)[source]

Learn the most frequent category if the imputation method is set to frequent.

Parameters
Xpandas dataframe of shape = [n_samples, n_features]

The training dataset.

ypandas Series, default=None

y is not needed in this imputation. You can pass None or y.

Returns
self
Raises
TypeError
  • If the input is not a Pandas DataFrame.

  • If any user provided variable is not categorical

ValueError

If there are no categorical variables in the df or the df is empty

transform(X)[source]

Replace missing data with the learned parameters.

Parameters
Xpandas dataframe of shape = [n_samples, n_features]

The data to be transformed.

Returns
Xpandas dataframe of shape = [n_samples, n_features]

The dataframe without missing values in the selected variables.

rtype

DataFrame ..

Raises
TypeError

If the input is not a Pandas DataFrame

ValueError

If the dataframe is not of same size as that used in fit()

Example

The CategoricalImputer() replaces missing data in categorical variables with the string ‘Missing’ or by the most frequent category.

It works only with categorical variables. A list of variables can be indicated, or the imputer will automatically select all categorical variables in the train set.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine.imputation import CategoricalImputer

# Load dataset
data = pd.read_csv('houseprice.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)

# set up the imputer
imputer = CategoricalImputer(variables=['Alley', 'MasVnrType'])

# fit the imputer
imputer.fit(X_train)

# transform the data
train_t= imputer.transform(X_train)
test_t= imputer.transform(X_test)

test_t['MasVnrType'].value_counts().plot.bar()
../_images/missingcategoryimputer.png