DropFeatures

API Reference

class feature_engine.selection.DropFeatures(features_to_drop)[source]

DropFeatures() drops a list of variable(s) indicated by the user from the dataframe.

When is this transformer useful?

Sometimes, we create new variables combining other variables in the dataset, for example, we obtain the variable age by subtracting date_of_application from date_of_birth. After we obtained our new variable, we do not need the date variables in the dataset any more. Thus, we can add DropFeatures() in the Pipeline to have these removed.

Parameters
features_to_drop: str or list

Variable(s) to be dropped from the dataframe

n_features_in_:

The number of features in the train set used in fit.

Methods

fit:

This transformer does not learn any parameter.

transform:

Drops indicated features.

fit_transform:

Fit to data, then transform it.

fit(X, y=None)[source]

This transformer does not learn any parameter.

Verifies that the input X is a pandas dataframe, and that the variables to drop exist in the training dataframe.

Parameters
Xpandas dataframe of shape = [n_samples, n_features]

The input dataframe

ypandas Series, default = None

y is not needed for this transformer. You can pass y or None.

Returns
self
transform(X)[source]

Return dataframe with selected features.

Parameters
X: pandas dataframe of shape = [n_samples, n_features].

The input dataframe.

Returns
X_transformed: pandas dataframe of shape = [n_samples, n_selected_features]

Pandas dataframe with the selected features.

rtype

DataFrame ..

Example

The DropFeatures() drops a list of variables indicated by the user from the original dataframe. The user can pass a single variable as a string or list of variables to be dropped.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine.selection import DropFeatures

# Load dataset
def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data['pclass'] = data['pclass'].astype('O')
        data['embarked'].fillna('C', inplace=True)
        return data

# load data as pandas dataframe
data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
            data.drop(['survived', 'name'], axis=1),
            data['survived'], test_size=0.3, random_state=0)

# original columns
X_train.columns
Index(['pclass', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin',
       'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')
# set up the transformer
transformer = DropFeatures(
    features_to_drop=['sibsp', 'parch', 'ticket', 'fare', 'body', 'home.dest']
)

# fit the transformer
transformer.fit(X_train)

# transform the data
train_t = transformer.transform(X_train)

train_t.columns
Index(['pclass', 'sex', 'age', 'cabin', 'embarked' 'boat'],
      dtype='object')