ArbitraryOutlierCapper

The ArbitraryOutlierCapper censors variable values at user pre-defined maximum and minimum values. For more details, read the API Reference below.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine import outlier_removers as outr

# Load dataset
def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data['pclass'] = data['pclass'].astype('O')
        data['embarked'].fillna('C', inplace=True)
        data['fare'] = data['fare'].astype('float')
        data['fare'].fillna(data['fare'].median(), inplace=True)
        data['age'] = data['age'].astype('float')
        data['age'].fillna(data['age'].median(), inplace=True)
        return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
                data.drop(['survived', 'name', 'ticket'], axis=1),
                data['survived'], test_size=0.3, random_state=0)

# set up the capper
capper = outr.ArbitraryOutlierCapper(
                max_capping_dict={'age': 50, 'fare': 200}, min_capping_dict=None)

# fit the capper
capper.fit(X_train)

# transform the data
train_t= capper.transform(X_train)
test_t= capper.transform(X_test)

capper.right_tail_caps_
{'age': 50, 'fare': 200}
train_t[['fare', 'age']].max()
fare    200
age      50
dtype: float64

API Reference

class feature_engine.outlier_removers.ArbitraryOutlierCapper(max_capping_dict=None, min_capping_dict=None, missing_values='raise')[source]

The ArbitraryOutlierCapper() caps the maximum or minimum values of a variable by an arbitrary value indicated by the user.

The user must provide the maximum or minimum values that will be used to cap each variable in a dictionary {feature:capping value}

Parameters
  • capping_max (dictionary, default=None) – user specified capping values on right tail of the distribution (maximum values).

  • capping_min (dictionary, default=None) – user specified capping values on left tail of the distribution (minimum values).

  • missing_values (string, default='raise') – Indicates if missing values should be ignored or raised. If missing_values=’raise’ the transformer will return an error if the training or other datasets contain missing values.

fit(X, y=None)[source]
Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples.

  • y (None) – y is not needed in this transformer. You can pass y or None.

right_tail_caps\_

The dictionary containing the maximum values at which variables will be capped.

Type

dictionary

left_tail_caps\_

The dictionary containing the minimum values at which variables will be capped.

Type

dictionary

transform(X)[source]

Caps the variable values, that is, censors outliers.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.

Returns

X_transformed – The dataframe with the capped variables.

Return type

pandas dataframe of shape = [n_samples, n_features]