SelectByTargetMeanPerformance

API Reference

class feature_engine.selection.SelectByTargetMeanPerformance(variables=None, scoring='roc_auc_score', threshold=0.5, bins=5, strategy='equal_width', cv=3, random_state=None)[source]

SelectByTargetMeanPerformance() selects features by using the mean value of the target per category or bin, if the variable is numerical, as proxy of target estimation, by determining its performance.

Works with both numerical and categorical variables.

The transformer works as follows:

  1. Separates the training set into train and test sets.

Then, for each categorical variable:

2. Determine the mean value of the target for each category of the variable using the train set (equivalent of Target mean encoding)

3. Replaces the categories in the test set, by the target mean values determined from the train set

  1. Using the encoded variable calculates the roc-auc or r2

5. Selects the features which roc-auc or r2 is bigger than the indicated threshold

For each numerical variable:

2. Discretize the variable into intervals of equal width or equal frequency (uses the discretizers of Feature-engine)

3. Determine the mean value of the target for each interval of the variable using the train set (equivalent of Target mean encoding)

4. Replaces the intervals in the test set, by the target mean values determined from the train set

  1. Using the encoded variable calculates the roc-auc or r2

6. Selects the features which roc-auc or r2 is bigger than the indicated threshold

Parameters
variableslist, default=None

The list of variables to evaluate. If None, the transformer will evaluate all variables in the dataset.

scoringstring, default=’roc_auc_score’

This indicates the metrics score to perform the feature selection. The current implementation supports ‘roc_auc_score’ and ‘r2_score’.

thresholdfloat, default = None

The performance threshold above which a feature will be selected.

binsint, default = 5

If the dataset contains numerical variables, the number of bins into which the values will be sorted.

strategystr, default = equal_width

whether to create the bins for discretization of numerical variables of equal width or equal frequency.

cvint, default=3

Desired number of cross-validation fold to be used to fit the estimator.

random_stateint, default=0

The random state setting in the train_test_split method.

Attributes

features_to_drop_:

List with the features to remove from the dataset.

feature_performance_:

Dictionary with the performance proxy per feature.

Methods

fit:

Find the important features.

transform:

Reduce X to the selected features.

fit_transform:

Fit to data, then transform it.

fit(X, y)[source]

Find the important features.

Parameters
Xpandas dataframe of shape = [n_samples, n_features]

The input dataframe

yarray-like of shape (n_samples)

Target variable. Required to train the estimator.

Returns
self
transform(X)[source]

Return dataframe with selected features.

Parameters
Xpandas dataframe of shape = [n_samples, n_features].

The input dataframe.

Returns
X_transformed: pandas dataframe of shape = [n_samples, n_selected_features]

Pandas dataframe with the selected features.

rtype

DataFrame ..

Example

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from feature_engine.selection import SelectByTargetMeanPerformance

# load data
data = pd.read_csv('../titanic.csv')

# extract cabin letter
data['cabin'] = data['cabin'].str[0]

# replace infrequent cabins by N
data['cabin'] = np.where(data['cabin'].isin(['T', 'G']), 'N', data['cabin'])

# cap maximum values
data['parch'] = np.where(data['parch']>3,3,data['parch'])
data['sibsp'] = np.where(data['sibsp']>3,3,data['sibsp'])

# cast variables as object to treat as categorical
data[['pclass','sibsp','parch']] = data[['pclass','sibsp','parch']].astype('O')

# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['survived'], axis=1),
    data['survived'],
    test_size=0.3,
    random_state=0)


# feature engine automates the selection for both categorical and numerical
# variables
sel = SelectByTargetMeanPerformance(
    variables=None,
    scoring="roc_auc_score",
    threshold=0.6,
    bins=3,
    strategy="equal_frequency",
    cv=2,# cross validation
    random_state=1, #seed for reproducibility
)

# find important features
sel.fit(X_train, y_train)

sel.variables_categorical_
['pclass', 'sex', 'sibsp', 'parch', 'cabin', 'embarked']
sel.variables_numerical_
['age', 'fare']
sel.feature_performance_
{'pclass': 0.6802934787230475,
 'sex': 0.7491365252482871,
 'age': 0.5345141148737766,
 'sibsp': 0.5720480307315783,
 'parch': 0.5243557188989476,
 'fare': 0.6600883312700917,
 'cabin': 0.6379782658154696,
 'embarked': 0.5672382248783936}
sel.features_to_drop_
['age', 'sibsp', 'parch', 'embarked']
# remove features
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)

X_train.shape, X_test.shape
((914, 4), (392, 4))