DropConstantFeatures

API Reference

class feature_engine.selection.DropConstantFeatures(variables=None, tol=1, missing_values='raise')[source]

Drop constant and quasi-constant variables from a dataframe. Constant variables show the same value across all the observations in the dataset. Quasi-constant variables show the same value in almost all the observations in the dataset.

By default, DropConstantFeatures() drops only constant variables. This transformer works with both numerical and categorical variables. The user can indicate a list of variables to examine. Alternatively, the transformer will evaluate all the variables in the dataset.

The transformer will first identify and store the constant and quasi-constant variables. Next, the transformer will drop these variables from a dataframe.

Parameters
variables: list, default=None

The list of variables to evaluate. If None, the transformer will evaluate all variables in the dataset.

tol: float,int, default=1

Threshold to detect constant/quasi-constant features. Variables showing the same value in a percentage of observations greater than tol will be considered constant / quasi-constant and dropped. If tol=1, the transformer removes constant variables. Else, it will remove quasi-constant variables.

missing_values: str, default=raises

Whether the missing values should be raised as error, ignored or included as an additional value of the variable, when considering if the feature is constant or quasi-constant. Takes values ‘raise’, ‘ignore’, ‘include’.

Attributes

features_to_drop_:

List with constant and quasi-constant features.

variables_:

The variables to consider for the feature selection.

n_features_in_:

The number of features in the train set used in fit.

See also

sklearn.feature_selection.VarianceThreshold

Notes

This transformer is a similar concept to the VarianceThreshold from Scikit-learn, but it evaluates number of unique values instead of variance

Methods

fit:

Find constant and quasi-constant features.

transform:

Remove constant and quasi-constant features.

fit_transform:

Fit to the data. Then transform it.

fit(X, y=None)[source]

Find constant and quasi-constant features.

Parameters
X: pandas dataframe of shape = [n_samples, n_features]

The input dataframe.

y: None

y is not needed for this transformer. You can pass y or None.

Returns
self
transform(X)[source]

Return dataframe with selected features.

Parameters
X: pandas dataframe of shape = [n_samples, n_features].

The input dataframe.

Returns
X_transformed: pandas dataframe of shape = [n_samples, n_selected_features]

Pandas dataframe with the selected features.

rtype

DataFrame ..

Example

The DropConstantFeatures() drops constant and quasi-constant variables from a dataframe. By default, DropConstantFeatures drops only constant variables. This transformer works with both numerical and categorical variables.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

from feature_engine.selection import DropConstantFeatures

# Load dataset
def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data['pclass'] = data['pclass'].astype('O')
        data['embarked'].fillna('C', inplace=True)
        return data

# load data as pandas dataframe
data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
            data.drop(['survived', 'name', 'ticket'], axis=1),
            data['survived'], test_size=0.3, random_state=0)

# set up the transformer
transformer = DropConstantFeatures(tol=0.7, missing_values='ignore')

# fit the transformer
transformer.fit(X_train)

# transform the data
train_t = transformer.transform(X_train)

transformer.constant_features_
['parch', 'cabin', 'embarked']

We see in the following code snippets that for the variables parch and embarked, more than 70% of the observations displayed the same value:

X_train['embarked'].value_counts() / len(X_train)
S    0.711790
C    0.197598
Q    0.090611
Name: embarked, dtype: float64

71% of the passengers embarked in S.

X_train['parch'].value_counts() / len(X_train)
0    0.771834
1    0.125546
2    0.086245
3    0.005459
4    0.004367
5    0.003275
6    0.002183
9    0.001092
Name: parch, dtype: float64

77% of the passengers had 0 parent or child. Because of this, these features were deemed constant and removed.