EndTailImputer

The EndTailImputer() replaces missing data with a value at the end of the distribution. The value can be determined using the mean plus or minus a number of times the standard deviation, or using the inter-quartile range proximity rule. The value can also be determined as a factor of the maximum value. See the API Reference below for more details.

The user decides whether the missing data should be placed at the right or left tail of the variable distribution.

It works only with numerical variables. A list of variables can be indicated, or the imputer will automatically select all numerical variables in the train set.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import feature_engine.missing_data_imputers as mdi

# Load dataset
data = pd.read_csv('houseprice.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)

# set up the imputer
tail_imputer = mdi.EndTailImputer(distribution='gaussian',
                          tail='right',
                          fold=3,
                          variables=['LotFrontage', 'MasVnrArea'])
# fit the imputer
tail_imputer.fit(X_train)

# transform the data
train_t= tail_imputer.transform(X_train)
test_t= tail_imputer.transform(X_test)

fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
../_images/endtailimputer.png

API Reference

class feature_engine.missing_data_imputers.EndTailImputer(distribution='gaussian', tail='right', fold=3, variables=None)[source]

The EndTailImputer() transforms features by replacing missing data by a value at either tail of the distribution.

The EndTailImputer() works only with numerical variables.

The user can indicate the variables to be imputed in a list. Alternatively, the EndTailImputer() will automatically find and select all variables of type numeric.

The imputer first calculates the values at the end of the distribution for each variable (fit). The values at the end of the distribution are determined using the Gaussian limits, the the IQR proximity rule limits, or a factor of the maximum value:

Gaussian limits:

right tail: mean + 3*std

left tail: mean - 3*std

IQR limits:

right tail: 75th quantile + 3*IQR

left tail: 25th quantile - 3*IQR

where IQR is the inter-quartile range = 75th quantile - 25th quantile

Maximum value:

right tail: max * 3

left tail: not applicable

You can change the factor that multiplies the std, IQR or the maximum value using the parameter ‘fold’.

The imputer then replaces the missing data with the estimated values (transform).

Parameters
  • distribution (str, default=gaussian) –

    Method to be used to find the replacement values. Can take ‘gaussian’, ‘skewed’ or ‘max’.

    gaussian: the imputer will use the Gaussian limits to find the values to replace missing data.

    skewed: the imputer will use the IQR limits to find the values to replace missing data.

    max: the imputer will use the maximum values to replace missing data. Note that if ‘max’ is passed, the parameter ‘tail’ is ignored.

  • tail (str, default=right) – Indicates if the values to replace missing data should be selected from the right or left tail of the variable distribution. Can take values ‘left’ or ‘right’.

  • fold (int, default=3) – Factor to multiply the std, the IQR or the Max values. Recommended values are 2 or 3 for Gaussian, or 1.5 or 3 for skewed.

  • variables (list, default=None) – The list of variables to be imputed. If None, the imputer will find and select all variables of type numeric.

fit(X, y=None)[source]

Learns the values at the end of the variable distribution.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. The user can pass the entire dataframe, not just the variables that need imputation.

  • y (None) – y is not needed in this imputation. You can pass None or y.

imputer_dict\_

The dictionary containing the values at the end of the distribution per variable. These values will be used by the imputer to replace missing data.

Type

dictionary

transform(X)[source]

Replaces missing data with the learned parameters.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The input samples.

Returns

X_transformed – The dataframe without missing values in the selected variables.

Return type

pandas dataframe of shape = [n_samples, n_features]