# Winsorizer#

The `Winsorizer()` caps maximum and/or minimum values of a variable at automatically determined values. The minimum and maximum values can be calculated in 1 of 3 different ways:

Gaussian limits:

• right tail: mean + 3* std

• left tail: mean - 3* std

IQR limits:

• right tail: 75th quantile + 3* IQR

• left tail: 25th quantile - 3* IQR

where IQR is the inter-quartile range: 75th quantile - 25th quantile.

percentiles or quantiles:

• right tail: 95th percentile

• left tail: 5th percentile

Example

Let’s cap some outliers in the Titanic Dataset. First, let’s load the data and separate it into train and test:

```import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine.outliers import Winsorizer

data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
data['fare'] = data['fare'].astype('float')
data['fare'].fillna(data['fare'].median(), inplace=True)
data['age'] = data['age'].astype('float')
data['age'].fillna(data['age'].median(), inplace=True)
return data

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['survived', 'name', 'ticket'], axis=1),
data['survived'], test_size=0.3, random_state=0)
```

Now, we will set the `Winsorizer()` to cap outliers at the right side of the distribution only (param `tail`). We want the maximum values to be determined using the mean value of the variable (param `capping_method`) plus 3 times the standard deviation (param `fold`). And we only want to cap outliers in 2 variables, which we indicate in a list.

```# set up the capper
capper = Winsorizer(capping_method='gaussian', tail='right', fold=3, variables=['age', 'fare'])

# fit the capper
capper.fit(X_train)
```

With `fit()`, the `Winsorizer()` finds the values at which it should cap the variables. These values are stored in its attribute:

```capper.right_tail_caps_
```
```{'age': 67.49048447470315, 'fare': 174.78162171790441}
```

We can now go ahead and censor the outliers:

```# transform the data
train_t= capper.transform(X_train)
test_t= capper.transform(X_test)
```

If we evaluate now the maximum of the variables in the transformed datasets, they should coincide with the values observed in the attribute `right_tail_caps_`:

```train_t[['fare', 'age']].max()
```
```fare    174.781622
age      67.490484
dtype: float64
```

## More details#

You can find more details about the `Winsorizer()` functionality in the following notebook: