EndTailImputer¶
API Reference¶
-
class
feature_engine.imputation.
EndTailImputer
(imputation_method='gaussian', tail='right', fold=3, variables=None)[source]¶ The EndTailImputer() transforms features by replacing missing data by a value at either tail of the distribution. Ti works only with numerical variables.
The user can indicate the variables to be imputed in a list. Alternatively, the EndTailImputer() will automatically find and select all variables of type numeric.
The imputer first calculates the values at the end of the distribution for each variable (fit). The values at the end of the distribution are determined using the Gaussian limits, the the IQR proximity rule limits, or a factor of the maximum value:
- Gaussian limits
right tail: mean + 3*std
left tail: mean - 3*std
- IQR limits:
right tail: 75th quantile + 3*IQR
left tail: 25th quantile - 3*IQR
where IQR is the inter-quartile range = 75th quantile - 25th quantile
- Maximum value:
right tail: max * 3
left tail: not applicable
You can change the factor that multiplies the std, IQR or the maximum value using the parameter ‘fold’.
The imputer then replaces the missing data with the estimated values (transform).
- Parameters
- imputation_methodstr, default=gaussian
Method to be used to find the replacement values. Can take ‘gaussian’, ‘iqr’ or ‘max’.
gaussian: the imputer will use the Gaussian limits to find the values to replace missing data.
iqr: the imputer will use the IQR limits to find the values to replace missing data.
max: the imputer will use the maximum values to replace missing data. Note that if ‘max’ is passed, the parameter ‘tail’ is ignored.
- tailstr, default=right
Indicates if the values to replace missing data should be selected from the right or left tail of the variable distribution. Can take values ‘left’ or ‘right’.
- foldint, default=3
Factor to multiply the std, the IQR or the Max values. Recommended values are 2 or 3 for Gaussian, or 1.5 or 3 for IQR.
- variableslist, default=None
The list of variables to be imputed. If None, the imputer will find and select all variables of type numeric.
Attributes
imputer_dict_:
Dictionary with the values at the end of the distribution per variable.
Methods
fit:
Learn values to replace missing data.
transform:
Impute missing data.
fit_transform:
Fit to the data, then transform it.
-
fit
(X, y=None)[source]¶ Learn the values at the end of the variable distribution.
- Parameters
- Xpandas dataframe of shape = [n_samples, n_features]
The training dataset.
- ypandas Series, default=None
y is not needed in this imputation. You can pass None or y.
- Returns
- self
- Raises
- TypeError
If the input is not a Pandas DataFrame
If any of the user provided variables are not numerical
- ValueError
If there are no numerical variables in the df or the df is empty
-
transform
(X)[source]¶ Replace missing data with the learned parameters.
- Parameters
- Xpandas dataframe of shape = [n_samples, n_features]
The data to be transformed.
- Returns
- Xpandas dataframe of shape = [n_samples, n_features]
The dataframe without missing values in the selected variables.
- rtype
DataFrame
..
- Raises
- TypeError
If the input is not a Pandas DataFrame
- ValueError
If the dataframe is not of same size as that used in fit()
Example¶
The EndTailImputer() replaces missing data with a value at the end of the distribution. The value can be determined using the mean plus or minus a number of times the standard deviation, or using the inter-quartile range proximity rule. The value can also be determined as a factor of the maximum value. See the API Reference below for more details.
The user decides whether the missing data should be placed at the right or left tail of the variable distribution.
It works only with numerical variables. A list of variables can be indicated, or the imputer will automatically select all numerical variables in the train set.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.imputation import EndTailImputer
# Load dataset
data = pd.read_csv('houseprice.csv')
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)
# set up the imputer
tail_imputer = EndTailImputer(imputation_method='gaussian',
tail='right',
fold=3,
variables=['LotFrontage', 'MasVnrArea'])
# fit the imputer
tail_imputer.fit(X_train)
# transform the data
train_t= tail_imputer.transform(X_train)
test_t= tail_imputer.transform(X_test)
fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
