DecisionTreeEncoder

API Reference

class feature_engine.encoding.DecisionTreeEncoder(encoding_method='arbitrary', cv=3, scoring='neg_mean_squared_error', param_grid=None, regression=True, random_state=None, variables=None, ignore_format=False)[source]

The DecisionTreeEncoder() encodes categorical variables with predictions of a decision tree.

The encoder first fits a decision tree using a single feature and the target (fit). And then replaces the values of the original feature by the predictions of the tree (transform). The transformer will train a Decision tree per every feature to encode.

The motivation is to try and create monotonic relationships between the categorical variables and the target.

Under the hood, the categorical variable will be first encoded into integers with the OrdinalCategoricalEncoder(). The integers can be assigned arbitrarily to the categories or following the mean value of the target in each category. Then a decision tree will fit the resulting numerical variable to predict the target variable. Finally, the original categorical variable values will be replaced by the predictions of the decision tree.

The DecisionTreeEncoder() will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode or the encoder will find and encode all categorical variables. But with ignore_format=True you have the option to encode numerical variables as well. In this case, you can either enter the list of variables to encode, or the transformer will automatically select all variables.

Parameters
encoding_method: str, default=’arbitrary’

The categorical encoding method that will be used to encode the original categories to numerical values.

‘ordered’: the categories are numbered in ascending order according to the target mean value per category.

‘arbitrary’ : categories are numbered arbitrarily.

cv: int, default=3

Desired number of cross-validation fold to be used to fit the decision tree.

scoring: str, default=’neg_mean_squared_error’

Desired metric to optimise the performance for the decision tree. Comes from sklearn.metrics. See the DecisionTreeRegressor or DecisionTreeClassifier model evaluation documentation for more options: https://scikit-learn.org/stable/modules/model_evaluation.html

param_grid: dictionary, default=None

The list of parameters over which the decision tree should be optimised during the grid search. The param_grid can contain any of the permitted parameters for Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier().

If None, then param_grid = {‘max_depth’: [1, 2, 3, 4]}.

regression: boolean, default=True

Indicates whether the encoder should train a regression or a classification decision tree.

random_state: int, default=None

The random_state to initialise the training of the decision tree. It is one of the parameters of the Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). For reproducibility it is recommended to set the random_state to an integer.

variables: list, default=None

The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.

ignore_format: bool, default=False

Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.

Attributes

encoder_:

sklearn Pipeline containing the ordinal encoder and the decision tree.

variables_:

The group of variables that will be transformed.

n_features_in_:

The number of features in the train set used in fit.

See also

sklearn.ensemble.DecisionTreeRegressor
sklearn.ensemble.DecisionTreeClassifier
feature_engine.discretisation.DecisionTreeDiscretiser
feature_engine.encoding.RareLabelEncoder
feature_engine.encoding.OrdinalEncoder

Notes

The authors designed this method originally, to work with numerical variables. We can replace numerical variables by the preditions of a decision tree utilising the DecisionTreeDiscretiser().

NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().

References

1

Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop and Conference Proceedings 7: 23-34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf

Methods

fit:

Fit a decision tree per variable.

transform:

Replace categorical variable by the predictions of the decision tree.

fit_transform:

Fit to the data, then transform it.

fit(X, y=None)[source]

Fit a decision tree per variable.

Parameters
Xpandas dataframe of shape = [n_samples, n_features]

The training input samples. Can be the entire dataframe, not just the categorical variables.

ypandas series.

The target variable. Required to train the decision tree and for ordered ordinal encoding.

Returns
self
Raises
TypeError
  • If the input is not a Pandas DataFrame.

  • f user enters non-categorical variables (unless ignore_format is True)

ValueError
  • If there are no categorical variables in the df or the df is empty

  • If the variable(s) contain null values

inverse_transform(X)[source]

inverse_transform is not implemented for this transformer.

transform(X)[source]

Replace categorical variable by the predictions of the decision tree.

Parameters
Xpandas dataframe of shape = [n_samples, n_features]

The input samples.

Returns
Xpandas dataframe of shape = [n_samples, n_features].

Dataframe with variables encoded with decision tree predictions.

rtype

DataFrame ..

Raises
TypeError

If the input is not a Pandas DataFrame

ValueError
  • If the variable(s) contain null values

  • If dataframe is not of same size as that used in fit()

Warning

If after encoding, NAN were introduced.

Example

The DecisionTreelEncoder() replaces categories in the variable with the predictions of a decision tree. The transformer first encodes categorical variables into numerical variables using ordinal encoding. You have the option to have the integers assigned to the categories as they appear in the variable, or ordered by the mean value of the target per category. After this, the transformer fits with this numerical variable a decision tree to predict the target variable. Finally, the original categorical variable is replaced by the predictions of the decision tree.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

from feature_engine.encoding import DecisionTreeEncoder

# Load dataset
def load_titanic():
        data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
        data = data.replace('?', np.nan)
        data['cabin'] = data['cabin'].astype(str).str[0]
        data['pclass'] = data['pclass'].astype('O')
        data['embarked'].fillna('C', inplace=True)
        return data

data = load_titanic()

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
                data.drop(['survived', 'name', 'ticket'], axis=1),
                data['survived'], test_size=0.3, random_state=0)

X_train[['cabin', 'pclass', 'embarked']].head(10)
      cabin pclass embarked
501      n      2        S
588      n      2        S
402      n      2        C
1193     n      3        Q
686      n      3        Q
971      n      3        Q
117      E      1        C
540      n      2        S
294      C      1        C
261      E      1        S
# set up the encoder
encoder = DecisionTreeEncoder(variables=['cabin', 'pclass', 'embarked'], random_state=0)

# fit the encoder
encoder.fit(X_train, y_train)

# transform the data
train_t = encoder.transform(X_train)
test_t = encoder.transform(X_test)

train_t[['cabin', 'pclass', 'embarked']].head(10)
     cabin    pclass  embarked
501   0.304843  0.307580  0.338957
588   0.304843  0.307580  0.338957
402   0.304843  0.307580  0.558011
1193  0.304843  0.307580  0.373494
686   0.304843  0.307580  0.373494
971   0.304843  0.307580  0.373494
117   0.649533  0.617391  0.558011
540   0.304843  0.307580  0.338957
294   0.649533  0.617391  0.558011
261   0.649533  0.617391  0.338957