EndTailImputer

API Reference

class feature_engine.imputation.EndTailImputer(imputation_method='gaussian', tail='right', fold=3, variables=None)[source]

The EndTailImputer() transforms features by replacing missing data by a value at either tail of the distribution. Ti works only with numerical variables.

The user can indicate the variables to be imputed in a list. Alternatively, the EndTailImputer() will automatically find and select all variables of type numeric.

The imputer first calculates the values at the end of the distribution for each variable (fit). The values at the end of the distribution are determined using the Gaussian limits, the the IQR proximity rule limits, or a factor of the maximum value:

Gaussian limits
  • right tail: mean + 3*std

  • left tail: mean - 3*std

IQR limits:
  • right tail: 75th quantile + 3*IQR

  • left tail: 25th quantile - 3*IQR

where IQR is the inter-quartile range = 75th quantile - 25th quantile

Maximum value:
  • right tail: max * 3

  • left tail: not applicable

You can change the factor that multiplies the std, IQR or the maximum value using the parameter ‘fold’.

The imputer then replaces the missing data with the estimated values (transform).

Parameters
imputation_methodstr, default=gaussian

Method to be used to find the replacement values. Can take ‘gaussian’, ‘iqr’ or ‘max’.

gaussian: the imputer will use the Gaussian limits to find the values to replace missing data.

iqr: the imputer will use the IQR limits to find the values to replace missing data.

max: the imputer will use the maximum values to replace missing data. Note that if ‘max’ is passed, the parameter ‘tail’ is ignored.

tailstr, default=right

Indicates if the values to replace missing data should be selected from the right or left tail of the variable distribution. Can take values ‘left’ or ‘right’.

foldint, default=3

Factor to multiply the std, the IQR or the Max values. Recommended values are 2 or 3 for Gaussian, or 1.5 or 3 for IQR.

variableslist, default=None

The list of variables to be imputed. If None, the imputer will find and select all variables of type numeric.

Attributes

imputer_dict_:

Dictionary with the values at the end of the distribution per variable.

Methods

fit:

Learn values to replace missing data.

transform:

Impute missing data.

fit_transform:

Fit to the data, then transform it.

fit(X, y=None)[source]

Learn the values at the end of the variable distribution.

Parameters
Xpandas dataframe of shape = [n_samples, n_features]

The training dataset.

ypandas Series, default=None

y is not needed in this imputation. You can pass None or y.

Returns
self
Raises
TypeError
  • If the input is not a Pandas DataFrame

  • If any of the user provided variables are not numerical

ValueError

If there are no numerical variables in the df or the df is empty

transform(X)[source]

Replace missing data with the learned parameters.

Parameters
Xpandas dataframe of shape = [n_samples, n_features]

The data to be transformed.

Returns
Xpandas dataframe of shape = [n_samples, n_features]

The dataframe without missing values in the selected variables.

rtype

DataFrame ..

Raises
TypeError

If the input is not a Pandas DataFrame

ValueError

If the dataframe is not of same size as that used in fit()

Example

The EndTailImputer() replaces missing data with a value at the end of the distribution. The value can be determined using the mean plus or minus a number of times the standard deviation, or using the inter-quartile range proximity rule. The value can also be determined as a factor of the maximum value. See the API Reference below for more details.

The user decides whether the missing data should be placed at the right or left tail of the variable distribution.

It works only with numerical variables. A list of variables can be indicated, or the imputer will automatically select all numerical variables in the train set.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine.imputation import EndTailImputer

# Load dataset
data = pd.read_csv('houseprice.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)

# set up the imputer
tail_imputer = EndTailImputer(imputation_method='gaussian',
                          tail='right',
                          fold=3,
                          variables=['LotFrontage', 'MasVnrArea'])
# fit the imputer
tail_imputer.fit(X_train)

# transform the data
train_t= tail_imputer.transform(X_train)
test_t= tail_imputer.transform(X_test)

fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
../_images/endtailimputer.png