ArbitraryOutlierCapper

API Reference

class feature_engine.outliers.ArbitraryOutlierCapper(max_capping_dict=None, min_capping_dict=None, missing_values='raise')[source]

The ArbitraryOutlierCapper() caps the maximum or minimum values of a variable at an arbitrary value indicated by the user.

The user must provide the maximum or minimum values that will be used to cap each variable in a dictionary {feature:capping value}

Parameters
max_capping_dictdictionary, default=None

Dictionary containing the user specified capping values for the right tail of the distribution of each variable (maximum values).

min_capping_dictdictionary, default=None

Dictionary containing user specified capping values for the eft tail of the distribution of each variable (minimum values).

missing_valuesstring, default=’raise’

Indicates if missing values should be ignored or raised. If missing_values='raise' the transformer will return an error if the training or the datasets to transform contain missing values.

Attributes

right_tail_caps_:

Dictionary with the maximum values at which variables will be capped.

left_tail_caps_ :

Dictionary with the minimum values at which variables will be capped.

Methods

fit:

This transformer does not learn any parameter.

transform:

Cap the variables.

fit_transform:

Fit to the data. Then transform it.

fit(X, y=None)[source]

This transformer does not learn any parameter.

Parameters
Xpandas dataframe of shape = [n_samples, n_features]

The training input samples.

ypandas Series, default=None

y is not needed in this transformer. You can pass y or None.

Returns
self
Raises
TypeError

If the input is not a Pandas DataFrame

transform(X)[source]

Cap the variable values, that is, censors outliers.

Parameters
Xpandas dataframe of shape = [n_samples, n_features]

The data to be transformed.

Returns
Xpandas dataframe of shape = [n_samples, n_features]

The dataframe with the capped variables.

rtype

DataFrame ..

Raises
TypeError

If the input is not a Pandas DataFrame

ValueError

If the dataframe is not of same size as that used in fit()

Example

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine.outliers import ArbitraryOutlierCapper

# Load dataset
def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data['pclass'] = data['pclass'].astype('O')
        data['embarked'].fillna('C', inplace=True)
        data['fare'] = data['fare'].astype('float')
        data['fare'].fillna(data['fare'].median(), inplace=True)
        data['age'] = data['age'].astype('float')
        data['age'].fillna(data['age'].median(), inplace=True)
        return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
                data.drop(['survived', 'name', 'ticket'], axis=1),
                data['survived'], test_size=0.3, random_state=0)

# set up the capper
capper = ArbitraryOutlierCapper(max_capping_dict={'age': 50, 'fare': 200}, min_capping_dict=None)

# fit the capper
capper.fit(X_train)

# transform the data
train_t= capper.transform(X_train)
test_t= capper.transform(X_test)

capper.right_tail_caps_
{'age': 50, 'fare': 200}
train_t[['fare', 'age']].max()
fare    200
age      50
dtype: float64