SelectByTargetMeanPerformance¶
API Reference¶
-
class
feature_engine.selection.
SelectByTargetMeanPerformance
(variables=None, scoring='roc_auc_score', threshold=0.5, bins=5, strategy='equal_width', cv=3, random_state=None)[source]¶ SelectByTargetMeanPerformance() selects features by using the mean value of the target per category or bin, if the variable is numerical, as proxy of target estimation, by determining its performance.
Works with both numerical and categorical variables.
The transformer works as follows:
Separates the training set into train and test sets.
Then, for each categorical variable:
2. Determine the mean value of the target for each category of the variable using the train set (equivalent of Target mean encoding)
3. Replaces the categories in the test set, by the target mean values determined from the train set
Using the encoded variable calculates the roc-auc or r2
5. Selects the features which roc-auc or r2 is bigger than the indicated threshold
For each numerical variable:
2. Discretize the variable into intervals of equal width or equal frequency (uses the discretizers of Feature-engine)
3. Determine the mean value of the target for each interval of the variable using the train set (equivalent of Target mean encoding)
4. Replaces the intervals in the test set, by the target mean values determined from the train set
Using the encoded variable calculates the roc-auc or r2
6. Selects the features which roc-auc or r2 is bigger than the indicated threshold
- Parameters
- variableslist, default=None
The list of variables to evaluate. If None, the transformer will evaluate all variables in the dataset.
- scoringstring, default=’roc_auc_score’
This indicates the metrics score to perform the feature selection. The current implementation supports ‘roc_auc_score’ and ‘r2_score’.
- thresholdfloat, default = None
The performance threshold above which a feature will be selected.
- binsint, default = 5
If the dataset contains numerical variables, the number of bins into which the values will be sorted.
- strategystr, default = equal_width
whether to create the bins for discretization of numerical variables of equal width or equal frequency.
- cvint, default=3
Desired number of cross-validation fold to be used to fit the estimator.
- random_stateint, default=0
The random state setting in the train_test_split method.
Attributes
features_to_drop_:
List with the features to remove from the dataset.
feature_performance_:
Dictionary with the performance proxy per feature.
Methods
fit:
Find the important features.
transform:
Reduce X to the selected features.
fit_transform:
Fit to data, then transform it.
Example¶
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from feature_engine.selection import SelectByTargetMeanPerformance
# load data
data = pd.read_csv('../titanic.csv')
# extract cabin letter
data['cabin'] = data['cabin'].str[0]
# replace infrequent cabins by N
data['cabin'] = np.where(data['cabin'].isin(['T', 'G']), 'N', data['cabin'])
# cap maximum values
data['parch'] = np.where(data['parch']>3,3,data['parch'])
data['sibsp'] = np.where(data['sibsp']>3,3,data['sibsp'])
# cast variables as object to treat as categorical
data[['pclass','sibsp','parch']] = data[['pclass','sibsp','parch']].astype('O')
# separate train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['survived'], axis=1),
data['survived'],
test_size=0.3,
random_state=0)
# feature engine automates the selection for both categorical and numerical
# variables
sel = SelectByTargetMeanPerformance(
variables=None,
scoring="roc_auc_score",
threshold=0.6,
bins=3,
strategy="equal_frequency",
cv=2,# cross validation
random_state=1, #seed for reproducibility
)
# find important features
sel.fit(X_train, y_train)
sel.variables_categorical_
['pclass', 'sex', 'sibsp', 'parch', 'cabin', 'embarked']
sel.variables_numerical_
['age', 'fare']
sel.feature_performance_
{'pclass': 0.6802934787230475,
'sex': 0.7491365252482871,
'age': 0.5345141148737766,
'sibsp': 0.5720480307315783,
'parch': 0.5243557188989476,
'fare': 0.6600883312700917,
'cabin': 0.6379782658154696,
'embarked': 0.5672382248783936}
sel.features_to_drop_
['age', 'sibsp', 'parch', 'embarked']
# remove features
X_train = sel.transform(X_train)
X_test = sel.transform(X_test)
X_train.shape, X_test.shape
((914, 4), (392, 4))