DropFeatures#

The DropFeatures() drops a list of variables indicated by the user from the original dataframe. The user can pass a single variable as a string or list of variables to be dropped.

DropFeatures() offers similar functionality to pandas.dataframe.drop, but the difference is that DropFeatures() can be integrated into a Scikit-learn pipeline.

When is this transformer useful?

Sometimes, we create new variables combining other variables in the dataset, for example, we obtain the variable age by subtracting date_of_application from date_of_birth. After we obtained our new variable, we do not need the date variables in the dataset any more. Thus, we can add DropFeatures() in the Pipeline to have these removed.

Example

Let’s see how to use DropFeatures() in an example with the Titanic dataset. We first load the data and separate it into train and test:

from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.selection import DropFeatures

X, y = load_titanic(
    return_X_y_frame=True,
    handle_missing=True,
)


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0,
)

print(X_train.head())

Now, we go ahead and print the dataset column names:

X_train.columns
Index(['pclass', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare',
       'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

Now, with DropFeatures() we can very easily drop a group of variables. Below we set up the transformer to drop a list of 6 variables:

# set up the transformer
transformer = DropFeatures(
    features_to_drop=['sibsp', 'parch', 'ticket', 'fare', 'body', 'home.dest']
)

# fit the transformer
transformer.fit(X_train)

With fit() this transformer does not learn any parameter. We can go ahead and remove the variables as follows:

train_t = transformer.transform(X_train)
test_t = transformer.transform(X_test)

And now, if we print the variable names of the transformed dataset, we see that it has been reduced:

train_t.columns
Index(['pclass', 'name', 'sex', 'age', 'cabin', 'embarked', 'boat'], dtype='object')

More details#

In this Kaggle kernel we feature 3 different end-to-end machine learning pipelines using DropFeatures():

All notebooks can be found in a dedicated repository.

For more details about this and other feature selection methods check out these resources: