OutlierTrimmer#

The OutlierTrimmer() removes values beyond an automatically generated minimum and/or maximum values. The minimum and maximum values can be calculated in 1 of 3 ways:

Gaussian limits:

  • right tail: mean + 3* std

  • left tail: mean - 3* std

IQR limits:

  • right tail: 75th quantile + 3* IQR

  • left tail: 25th quantile - 3* IQR

where IQR is the inter-quartile range: 75th quantile - 25th quantile.

MAD limits:

  • right tail: median + 3* MAD

  • left tail: median - 3* MAD

where MAD is the median absoulte deviation from the median.

percentiles or quantiles:

  • right tail: 95th percentile

  • left tail: 5th percentile

Example

Let’s remove some outliers in the Titanic Dataset. First, let’s load the data and separate it into train and test:

from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.outliers import OutlierTrimmer

X, y = load_titanic(
    return_X_y_frame=True,
    predictors_only=True,
    handle_missing=True,
)


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0,
)

print(X_train.head())

We see the resulting data below:

      pclass     sex        age  sibsp  parch     fare    cabin embarked
501        2  female  13.000000      0      1  19.5000  Missing        S
588        2  female   4.000000      1      1  23.0000  Missing        S
402        2  female  30.000000      1      0  13.8583  Missing        C
1193       3    male  29.881135      0      0   7.7250  Missing        Q
686        3  female  22.000000      0      0   7.7250  Missing        Q

Now, we will set the OutlierTrimmer() to remove outliers at the right side of the distribution only (param tail). We want the maximum values to be determined using the 75th quantile of the variable (param capping_method) plus 1.5 times the IQR (param fold). And we only want to cap outliers in 2 variables, which we indicate in a list.

capper = OutlierTrimmer(capping_method='iqr',
                        tail='right',
                        fold=1.5,
                        variables=['age', 'fare'])

capper.fit(X_train)

With fit(), the OutlierTrimmer() finds the values at which it should cap the variables. These values are stored in its attribute:

capper.right_tail_caps_
{'age': 53.0, 'fare': 66.34379999999999}

We can now go ahead and remove the outliers:

train_t = capper.transform(X_train)
test_t = capper.transform(X_test)

If we evaluate now the maximum of the variables in the transformed datasets, they should be <= the values observed in the attribute right_tail_caps_:

train_t[['fare', 'age']].max()
fare    65.0
age     53.0
dtype: float64

More details#

You can find more details about the OutlierTrimmer() functionality in the following notebook:

For more details about this and other feature engineering methods check out these resources: