DropConstantFeatures#

Constant features are variables that show zero variability, or, in other words, have the same value in all rows. A key step towards training a machine learning model is to identify and remove constant features.

Features with no or low variability rarely constitute useful predictors. Hence, removing them right at the beginning of the data science project is a good way of simplifying your dataset and subsequent data preprocessing pipelines.

Filter methods are selection algorithms that select or remove features based solely on their characteristics. In this light, removing constant features could be considered part of the filter group of selection algorithms.

In Python, we can find constant features by using pandas std or unique methods, and then remove them with drop.

With Scikit-learn, we can find and remove constant variables with VarianceThreshold to quickly reduce the number of features. VarianceThreshold is part of sklearn.feature_selection’s API.

VarianceThreshold, however, would only work with numerical variables. Hence, we could only evaluate categorical variables after encoding them, which requires a prior step of data preprocessing just to remove redundant variables.

Feature-engine introduces DropConstantFeatures() to find and remove constant and quasi-constant features from a dataframe. DropConstantFeatures() works with numerical, categorical, or datetime variables. It is therefore more versatile than Scikit-learn’s transformer because it allows us to drop all duplicate variables without the need for prior data transformations.

By default, DropConstantFeatures() drops constant variables. We also have the option to drop quasi-constant features, which are those that show mostly constant values and some other values in a very small percentage of rows.

Because DropConstantFeatures() works with numerical and categorical variables alike, it offers a straightforward way of reducing the feature subset.

Be mindful, though, that depending on the context, quasi-constant variables could be useful.

Example

Let’s see how to use DropConstantFeatures() by using the Titanic dataset. This dataset does not contain constant or quasi-constant variables, so for the sake of the demonstration, we will consider quasi-constant those features that show the same value in more than 70% of the rows.

We first load the data and separate it into a training set and a test set:

from sklearn.model_selection import train_test_split
from feature_engine.datasets import load_titanic
from feature_engine.selection import DropConstantFeatures

X, y = load_titanic(
    return_X_y_frame=True,
    handle_missing=True,
)


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=0,
)

Now, we set up the DropConstantFeatures() to remove features that show the same value in more than 70% of the observations. We do this through the parameter tol. The default value for this parameter is zero, in which case it will remove constant features.

# set up the transformer
transformer = DropConstantFeatures(tol=0.7)

With fit() the transformer finds the variables to drop:

# fit the transformer
transformer.fit(X_train)

The variables to drop are stored in the attribute features_to_drop_:

transformer.features_to_drop_
['parch', 'cabin', 'embarked', 'body']

We can check that the variables parch and embarked show the same value in more than 70% of the observations as follows:

X_train['embarked'].value_counts(normalize = True)
S          0.711790
C          0.195415
Q          0.090611
Missing    0.002183
Name: embarked, dtype: float64

Based on the previous results, 71% of the passengers embarked in S.

Let’s now evaluate parch:

X_train['parch'].value_counts(normalize = True)
0    0.771834
1    0.125546
2    0.086245
3    0.005459
4    0.004367
5    0.003275
6    0.002183
9    0.001092
Name: parch, dtype: float64

Based on the previous results, 77% of the passengers had 0 parent or child. Because of this, these features were deemed quasi-constant and will be removed in the next step.

We can also identify quasi-constant variables as follows:

import pandas

X_train["embarked"].value_counts(normalize=True).plot.bar()

After executing the previous code, we observe the following plot, with more than 70% of passengers embarking in S:

../../_images/quasiconstant.png

With transform(), we drop the quasi-constant variables from the dataset:

train_t = transformer.transform(X_train)
test_t = transformer.transform(X_test)

print(train_t.head())

We see the resulting dataframe below:

      pclass                               name     sex        age  sibsp  \
501        2  Mellinger, Miss. Madeleine Violet  female  13.000000      0
588        2                  Wells, Miss. Joan  female   4.000000      1
402        2     Duran y More, Miss. Florentina  female  30.000000      1
1193       3                 Scanlan, Mr. James    male  29.881135      0
686        3       Bradley, Miss. Bridget Delia  female  22.000000      0

             ticket     fare     boat  \
501          250644  19.5000       14
588           29103  23.0000       14
402   SC/PARIS 2148  13.8583       12
1193          36209   7.7250  Missing
686          334914   7.7250       13

                                              home.dest
501                            England / Bennington, VT
588                                Cornwall / Akron, OH
402                     Barcelona, Spain / Havana, Cuba
1193                                            Missing
686   Kingwilliamstown, Co Cork, Ireland Glens Falls...

Like sklearn, Feature-engine transformers have the fit_transform method that allows us to find and remove constant or quasi-constant variables in a single line of code for convenience.

Like sklearn as well, DropConstantFeatures() has the get_support() method, which returns a vector with values True for features that will be retained and False for those that will be dropped.

transformer.get_support()
[True, True, True, True, True, False, True, True, False, False,
 True, False, True]

This and other feature selection methods may not necessarily avoid overfitting, but they contribute to simplifying our machine learning pipelines and creating more interpretable machine learning models.

More details#

In this Kaggle kernel we use DropConstantFeatures() together with other feature selection algorithms and then train a Logistic regression estimator:

For more details about this and other feature selection methods check out these resources: