# OutlierTrimmer#

The `OutlierTrimmer()` removes values beyond an automatically generated minimum and/or maximum values. The minimum and maximum values can be calculated in 1 of 3 ways:

Gaussian limits:

• right tail: mean + 3* std

• left tail: mean - 3* std

IQR limits:

• right tail: 75th quantile + 3* IQR

• left tail: 25th quantile - 3* IQR

where IQR is the inter-quartile range: 75th quantile - 25th quantile.

percentiles or quantiles:

• right tail: 95th percentile

• left tail: 5th percentile

Example

Let’s remove some outliers in the Titanic Dataset. First, let’s load the data and separate it into train and test:

```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine.outliers import OutlierTrimmer

data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
data['fare'] = data['fare'].astype('float')
data['fare'].fillna(data['fare'].median(), inplace=True)
data['age'] = data['age'].astype('float')
data['age'].fillna(data['age'].median(), inplace=True)
return data

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['survived', 'name', 'ticket'], axis=1),
data['survived'], test_size=0.3, random_state=0)
```

Now, we will set the `OutlierTrimmer()` to remove outliers at the right side of the distribution only (param `tail`). We want the maximum values to be determined using the 75th quantile of the variable (param `capping_method`) plus 1.5 times the IQR (param `fold`). And we only want to cap outliers in 2 variables, which we indicate in a list.

```# set up the capper
capper = OutlierTrimmer(capping_method='iqr', tail='right', fold=1.5, variables=['age', 'fare'])

# fit the capper
capper.fit(X_train)
```

With `fit()`, the `OutlierTrimmer()` finds the values at which it should cap the variables. These values are stored in its attribute:

```capper.right_tail_caps_
```
```{'age': 53.0, 'fare': 66.34379999999999}
```

We can now go ahead and remove the outliers:

```# transform the data
train_t= capper.transform(X_train)
test_t= capper.transform(X_test)
```

If we evaluate now the maximum of the variables in the transformed datasets, they should be <= the values observed in the attribute `right_tail_caps_`:

```train_t[['fare', 'age']].max()
```
```fare    65.0
age     53.0
dtype: float64
```

## More details#

You can find more details about the `OutlierTrimmer()` functionality in the following notebook:

All notebooks can be found in a dedicated repository.