.. _categorical_imputer:

.. currentmodule:: feature_engine.imputation

CategoricalImputer
==================

Categorical data are common in most data science projects and can also show missing
values. There are **2 main imputation methods** that are used to replace missing data in
categorical variables. One method consists of replacing the missing values with the most
frequent category. The second method consists of replacing missing values with a dedicated
string, for example, "Missing."

Scikit-learn's machine learning algorithms can neither handle missing data nor categorical
variables out of the box. Hence, during data preprocessing, we need to use imputation
techniques to replace the nan values by any permitted value and then proceed with
categorical encoding, before training classification or regression models.

Handling missing values
-----------------------

Feature-engine's :class:`CategoricalImputer()` can replace missing data in categorical
variables with an arbitrary value, like the string 'Missing', or with the most frequent
category.

You can impute a subset of the categorical variables by passing their names to
:class:`CategoricalImputer()` in a list. Alternatively, the categorical imputer automatically
finds and imputes all variables of type object and categorical found in the training dataframe.

Originally, we designed this imputer to work only with categorical variables. In version
1.1.0, we introduced the parameter `ignore_format` to allow the imputer to also impute
numerical variables with this functionality. This is because, in some cases, variables
that are by nature categorical have numerical values.﻿

Python implementation
---------------------

We'll show the :class:`CategoricalImputer()`'s data imputation functionality using the
Ames house prices dataset. We'll start by loading the necessary libraries, functions and
classes, loading the dataset, and separating it into a training and a test set.

.. code:: python

    import matplotlib.pyplot as plt
    from sklearn.datasets import fetch_openml
    from sklearn.model_selection import train_test_split

    from feature_engine.imputation import CategoricalImputer

    data = fetch_openml(name='house_prices', as_frame=True)
    data = data.frame

    X = data.drop(['SalePrice', 'Id'], axis=1)
    y = data['SalePrice']

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42)

    print(X_train.head())

In the following output we see the predictor variables of the house prices dataset:

.. code:: python

          MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
    254           20       RL         70.0     8400   Pave   NaN      Reg
    1066          60       RL         59.0     7837   Pave   NaN      IR1
    638           30       RL         67.0     8777   Pave   NaN      Reg
    799           50       RL         60.0     7200   Pave   NaN      Reg
    380           50       RL         50.0     5000   Pave  Pave      Reg

         LandContour Utilities LotConfig  ... ScreenPorch PoolArea PoolQC  Fence  \
    254          Lvl    AllPub    Inside  ...           0        0    NaN    NaN
    1066         Lvl    AllPub    Inside  ...           0        0    NaN    NaN
    638          Lvl    AllPub    Inside  ...           0        0    NaN  MnPrv
    799          Lvl    AllPub    Corner  ...           0        0    NaN  MnPrv
    380          Lvl    AllPub    Inside  ...           0        0    NaN    NaN

         MiscFeature MiscVal  MoSold  YrSold  SaleType  SaleCondition
    254          NaN       0       6    2010        WD         Normal
    1066         NaN       0       5    2009        WD         Normal
    638          NaN       0       5    2008        WD         Normal
    799          NaN       0       6    2007        WD         Normal
    380          NaN       0       5    2010        WD         Normal

    [5 rows x 79 columns]

These 2 variables show null values, let's check that out:

.. code:: python

    X_train[['Alley', 'MasVnrType']].isnull().sum()

We see the null values in the following output:

.. code:: python

    Alley         1094
    MasVnrType       6
    dtype: int64

Imputation with an arbitrary string
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let's set up the categorical imputer to impute these 2 variables with the arbitrary
string 'missing':

.. code:: python

   imputer = CategoricalImputer(
       variables=['Alley', 'MasVnrType'],
       fill_value="missing",
   )

   imputer.fit(X_train)

During fit, the transformer corroborates that the 2 variables are of type object or
categorical and creates a dictionary of variable to replacement value.

We can check the value that will be use to "fillna" as follows:

.. code:: python

    imputer.fill_value

We can check the dictionary with the replacement values per variable like this:

.. code:: python

    imputer.imputer_dict_

The dictionary contains the names of the variables in its keys and the imputation
value among its values. In this case, the result is not super exciting because we
are replacing nan values in all variables with the same value:

.. code:: python

    {'Alley': 'missing', 'MasVnrType': 'missing'}


We can now go ahead and impute the missing data and then plot the categories in the
resulting variable after the imputation:

.. code:: python

    train_t = imputer.transform(X_train)
    test_t = imputer.transform(X_test)

    test_t['MasVnrType'].value_counts().plot.bar()
    plt.ylabel("Number of observations")
    plt.show()

In the following plot, we see the presence of the category "missing", corresponding to
the imputed values:

.. image:: ../../images/missingcategoryimputer.png

|

Imputation with the most frequent category
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let's now impute the variables with the most frequent category instead:

.. code:: python

   imputer = CategoricalImputer(
       variables=['Alley', 'MasVnrType'],
       imputation_method="frequent"
   )

   imputer.fit(X_train)

We can find the most frequent category per variable in the imputer dictionary:

.. code:: python

    imputer.imputer_dict_

In the following output, we see that the most frequent category for `Alley` is `'Grvl'` and the
most frequent value for `MasVnrType` is `'None'`.

.. code:: python

    {'Alley': 'Grvl', 'MasVnrType': 'None'}

We can now go ahead and impute the missing data to obtain a complete dataset, at least
for these 2 variables, and then plot the distribution of values after the
imputation:

.. code:: python

    train_t = imputer.transform(X_train)
    test_t = imputer.transform(X_test)

    test_t['MasVnrType'].value_counts().plot.bar()
    plt.ylabel("Number of observations")
    plt.show()

In the following image we see the resulting variable distribution:

.. image:: ../../images/frequentcategoryimputer.png

|

Automatically impute all categorical variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

:class:`CategoricalImputer()` can automatically find and impute all categorical features
in the training dataset when we set the parameter `variables` to None:

.. code:: python

   imputer = CategoricalImputer(
       variables=None,
   )

   train_t = imputer.fit_transform(X_train)
   test_t = imputer.transform(X_test)

We can find the categorical variables in the `variables_` attribute:

.. code:: python

    imputer.variables_

Below, we see the list of categorical variables that were found in the training
dataframe:

.. code:: python

    ['MSZoning',
     'Street',
     'Alley',
     'LotShape',
     'LandContour',
     ...
     'SaleType',
     'SaleCondition']

Categorical features with 2 modes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It is possible that one variable has more than one mode. In that case, the
transformer will raise an error. For example, when you set the transformer to
impute the variable ‘PoolQC` with the most frequent value:

.. code:: python

   imputer = CategoricalImputer(
       variables=['PoolQC'],
       imputation_method="frequent"
   )

   imputer.fit(X_train)

'PoolQC`  has more than 1 mode, so the transformer raises the following error:

.. code:: python

        196     self.imputer_dict_ = {var: mode_vals[0]}
        198 # imputing multiple variables:
        199 else:
        200     # Returns a dataframe with 1 row if there is one mode per
        201     # variable, or more rows if there are more modes:

    ValueError: The variable PoolQC contains multiple frequent categories.

We can check that the variable has various modes like this:

.. code:: python

    X_train['PoolQC'].mode()

We see that this variable has 3 categories with similar maximum number of observations:

.. code:: python

    0    Ex
    1    Fa
    2    Gd
    Name: PoolQC, dtype: object

Considerations
--------------

Replacing missing values in categorical features with a bespoke category is standard
practice and perhaps the more natural thing to do. We'll probably want to impute with
the most frequent category when the percentage of missing values is small and the
cardinality of the variable is low, not to introduce unnecessary noise.

Combining imputation with data analysis is useful to decide the most convenient imputation
method as well as the impact of the imputation on the variable distribution. Note that the
variable distribution and its cardinality will affect the performance and workings of
machine learning models.

Imputation with the most frequent category will blend the missing values with the most
common values of the variable. Hence, it is common practice to add dummy variables to
indicate that the values were originally missing. See :class:`AddMissingIndicator`.

Additional resources
--------------------

For more details about this and other feature engineering methods check out these resources:

.. figure::  ../../images/feml.png
   :width: 300
   :figclass: align-center
   :align: left
   :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning

   Feature Engineering for Machine Learning

|
|
|
|
|
|
|
|
|
|

Or read our book:

.. figure::  ../../images/cookbook.png
   :width: 200
   :figclass: align-center
   :align: left
   :target: https://packt.link/0ewSo

   Python Feature Engineering Cookbook

|
|
|
|
|
|
|
|
|
|
|
|
|

Both our book and course are suitable for beginners and more advanced data scientists
alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.