SelectByTargetMeanPerformance

API Reference

class feature_engine.selection.SelectByTargetMeanPerformance(variables=None, scoring='roc_auc_score', threshold=0.5, bins=5, strategy='equal_width', cv=3, random_state=None)[source]

SelectByTargetMeanPerformance() uses the mean value of the target per category, or interval if the variable is numerical, as proxy for target estimation. With this proxy and the real target, the selector determines a performance metric for each feature, and then selects them based on this performance metric.

SelectByTargetMeanPerformance() works with numerical and categorical variables.

The transformer works as follows:

  1. Separates the training set into train and test sets.

Then, for each categorical variable:

2. Determines the mean target value per category per variable using the train set (equivalent of Target mean encoding)

  1. Replaces the categories in the test set by the target mean values

  2. Using the encoded variables and the real target calculates the roc-auc or r2

5. Selects the features which roc-auc or r2 is bigger than the indicated threshold

For each numerical variable:

2. Discretizes the variable into intervals of equal width or equal frequency (uses the discretizers of Feature-engine)

3. Determines the mean value of the target per interval per variable using the train set

  1. Replaces the intervals in the test set, by the target mean values

  2. Using the encoded variable and the real target calculates the roc-auc or r2

6. Selects the features which roc-auc or r2 is bigger than the indicated threshold

Parameters
variables: list, default=None

The list of variables to evaluate. If None, the transformer will evaluate all variables in the dataset.

scoring: string, default=’roc_auc_score’

This indicates the metrics score to perform the feature selection. The current implementation supports ‘roc_auc_score’ and ‘r2_score’.

threshold: float, default = None

The performance threshold above which a feature will be selected.

bins: int, default = 5

If the dataset contains numerical variables, the number of bins into which the values will be sorted.

strategy: str, default = equal_width

Whether to create the bins for discretization of numerical variables of equal width or equal frequency.

cv: int, default=3

Desired number of cross-validation fold to be used to fit the estimator.

random_state: int, default=0

The random state setting in the train_test_split method.

Attributes

features_to_drop_:

List with the features to remove from the dataset.

feature_performance_:

Dictionary with the performance proxy per feature.

variables_:

The variables to consider for the feature selection.

n_features_in_:

The number of features in the train set used in fit.

Methods

fit:

Find the important features.

transform:

Reduce X to the selected features.

fit_transform:

Fit to data, then transform it.

fit(X, y)[source]

Find the important features.

Parameters
X: pandas dataframe of shape = [n_samples, n_features]

The input dataframe

y: array-like of shape (n_samples)

Target variable. Required to train the estimator.

Returns
self
transform(X)[source]

Return dataframe with selected features.

Parameters
X: pandas dataframe of shape = [n_samples, n_features].

The input dataframe.

Returns
X_transformed: pandas dataframe of shape = [n_samples, n_selected_features]

Pandas dataframe with the selected features.

rtype

DataFrame ..

Example

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from feature_engine.selection import SelectByTargetMeanPerformance

# load data
data = pd.read_csv('../titanic.csv')

# extract cabin letter
data['cabin'] = data['cabin'].str[0]

# replace infrequent cabins by N
data['cabin'] = np.where(data['cabin'].isin(['T', 'G']), 'N', data['cabin'])

# cap maximum values
data['parch'] = np.where(data['parch']>3,3,data['parch'])
data['sibsp'] = np.where(data['sibsp']>3,3,data['sibsp'])

# cast variables as object to treat as categorical
data[['pclass','sibsp','parch']] = data[['pclass','sibsp','parch']].astype('O')

# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['survived'], axis=1),
    data['survived'],
    test_size=0.3,
    random_state=0)


# feature engine automates the selection for both categorical and numerical
# variables
sel = SelectByTargetMeanPerformance(
    variables=None,
    scoring="roc_auc_score",
    threshold=0.6,
    bins=3,
    strategy="equal_frequency",
    cv=2,# cross validation
    random_state=1, #seed for reproducibility
)

# find important features
sel.fit(X_train, y_train)

sel.variables_categorical_
['pclass', 'sex', 'sibsp', 'parch', 'cabin', 'embarked']
sel.variables_numerical_
['age', 'fare']
sel.feature_performance_
{'pclass': 0.6802934787230475,
 'sex': 0.7491365252482871,
 'age': 0.5345141148737766,
 'sibsp': 0.5720480307315783,
 'parch': 0.5243557188989476,
 'fare': 0.6600883312700917,
 'cabin': 0.6379782658154696,
 'embarked': 0.5672382248783936}
sel.features_to_drop_
['age', 'sibsp', 'parch', 'embarked']
# remove features
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape
((914, 4), (392, 4))