DropFeatures

The DropFeatures() drops a list of variables indicated by the user from the original dataframe. The user can pass a single variable as a string or list of variables to be dropped.

DropFeatures() offers similar functionality to pandas.dataframe.drop, but the difference is that DropFeatures() can be integrated into a Scikit-learn pipeline.

When is this transformer useful?

Sometimes, we create new variables combining other variables in the dataset, for example, we obtain the variable age by subtracting date_of_application from date_of_birth. After we obtained our new variable, we do not need the date variables in the dataset any more. Thus, we can add DropFeatures() in the Pipeline to have these removed.

Example

Let’s see how to use DropFeatures() in an example with the Titanic dataset. We first load the data and separate it into train and test:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine.selection import DropFeatures

# Load dataset
def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data['pclass'] = data['pclass'].astype('O')
        data['embarked'].fillna('C', inplace=True)
        return data

# load data as pandas dataframe
data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
            data.drop(['survived', 'name'], axis=1),
            data['survived'], test_size=0.3, random_state=0)

Now, we go ahead and print the dataset column names:

# original columns
X_train.columns
Index(['pclass', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin',
       'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

Now, with DropFeatures() we can very easily drop a group of variables. Below we set up the transformer to drop a list of 6 variables:

# set up the transformer
transformer = DropFeatures(
    features_to_drop=['sibsp', 'parch', 'ticket', 'fare', 'body', 'home.dest']
)

# fit the transformer
transformer.fit(X_train)

With fit() this transformer does not learn any parameter. We can go ahead and remove the variables as follows:

# transform the data
train_t = transformer.transform(X_train)

And now, if we print the variable names of the transformed dataset, we see that it has been reduced:

train_t.columns
Index(['pclass', 'sex', 'age', 'cabin', 'embarked' 'boat'],
      dtype='object')

More details

In this Kaggle kernel we feature 3 different end-to-end machine learning pipelines using DropFeatures():

All notebooks can be found in a dedicated repository.