CategoricalVariableImputer

The CategoricalVariableImputer() replaces missing data in categorical variables with the string ‘Missing’ or by the most frequent category.

It works only with categorical variables. A list of variables can be indicated, or the imputer will automatically select all categorical variables in the train set.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import feature_engine.missing_data_imputers as mdi

# Load dataset
data = pd.read_csv('houseprice.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)

# set up the imputer
imputer = mdi.CategoricalVariableImputer(variables=['Alley', 'MasVnrType'])

# fit the imputer
imputer.fit(X_train)

# transform the data
train_t= imputer.transform(X_train)
test_t= imputer.transform(X_test)

test_t['MasVnrType'].value_counts().plot.bar()
../_images/missingcategoryimputer.png

API Reference

class feature_engine.missing_data_imputers.CategoricalVariableImputer(imputation_method='missing', fill_value='Missing', variables=None, return_object=False)[source]

The CategoricalVariableImputer() replaces missing data in categorical variables by the string ‘Missing’ or by the most frequent category.

The CategoricalVariableImputer() works only with categorical variables.

The user can pass a list with the variables to be imputed. Alternatively, the CategoricalVariableImputer() will automatically find and select all variables of type object.

Parameters
  • imputation_method (str, default=missing) – Desired method of imputation. Can be ‘frequent’ or ‘missing’.

  • fill_value (str, default='Missing') – Only used when imputation_method=’missing’. Can be used to set a user-defined value to replace the missing data.

  • variables (list, default=None) – The list of variables to be imputed. If None, the imputer will find and select all object type variables.

  • return_object (bool, default=False) –

    If working with numerical variables cast as object, decide whether to return the variables as numeric or re-cast them as object. Note that pandas will re-cast them automatically as numeric after the transformation with the mode.

    Tip: return the variables as object if planning to do categorical encoding with feature-engine.

fit(X, y=None)[source]

Learns the most frequent category if the imputation method is set to frequent.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the selected variables.

  • y (None) – y is not needed in this imputation. You can pass None or y.

imputer_dict\_

The dictionary mapping each variable to the most frequent category, or to the value ‘Missing’ depending on the imputation_method. The most frequent category is calculated when fitting the transformer.

Type

dictionary

transform(X)[source]

Replaces missing data with the learned parameters.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The input samples.

Returns

X_transformed – The dataframe without missing values in the selected variables.

Return type

pandas dataframe of shape = [n_samples, n_features]