DecisionTreeDiscretiser¶
API Reference¶
- class feature_engine.discretisation.DecisionTreeDiscretiser(variables=None, cv=3, scoring='neg_mean_squared_error', param_grid=None, regression=True, random_state=None)[source]¶
The DecisionTreeDiscretiser() replaces continuous numerical variables by discrete, finite, values estimated by a decision tree.
The methods is inspired by the following article from the winners of the KDD 2009 competition: http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf
The DecisionTreeDiscretiser() works only with numerical variables. A list of variables can be passed as an argument. Alternatively, the discretiser will automatically select all numerical variables.
The DecisionTreeDiscretiser() first trains a decision tree for each variable.
The DecisionTreeDiscretiser() then transforms the variables, that is, makes predictions based on the variable values, using the trained decision tree.
- Parameters
- variables: list, default=None
The list of numerical variables to transform. If None, the discretiser will automatically select all numerical variables.
- cv: int, default=3
Desired number of cross-validation fold to be used to fit the decision tree.
- scoring: str, default=’neg_mean_squared_error’
Desired metric to optimise the performance for the tree. Comes from sklearn.metrics. See DecisionTreeRegressor or DecisionTreeClassifier model evaluation documentation for more options: https://scikit-learn.org/stable/modules/model_evaluation.html
- param_grid: dictionary, default=None
The list of parameters over which the decision tree should be optimised during the grid search. The param_grid can contain any of the permitted parameters for Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier().
If None, then param_grid = {‘max_depth’: [1, 2, 3, 4]}
- regression: boolean, default=True
Indicates whether the discretiser should train a regression or a classification decision tree.
- random_stateint, default=None
The random_state to initialise the training of the decision tree. It is one of the parameters of the Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). For reproducibility it is recommended to set the random_state to an integer.
Attributes
binner_dict_:
Dictionary containing the fitted tree per variable.
scores_dict_:
Dictionary with the score of the best decision tree, over the train set.
variables_:
The variables to discretise.
n_features_in_:
The number of features in the train set used in fit.
See also
sklearn.tree.DecisionTreeClassifier
sklearn.tree.DecisionTreeRegressor
References
- 1
Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop and Conference Proceedings 7: 23-34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf
Methods
fit:
Fit a decision tree per variable.
transform:
Replace continuous values by the predictions of the decision tree.
fit_transform:
Fit to the data, then transform it.
- fit(X, y)[source]¶
Fit the decision trees. One tree per variable to be transformed.
- Parameters
- X: pandas dataframe of shape = [n_samples, n_features]
The training dataset. Can be the entire dataframe, not just the variables to be transformed.
- y: pandas series.
Target variable. Required to train the decision tree.
- Returns
- self
- Raises
- TypeError
If the input is not a Pandas DataFrame
If any of the user provided variables are not numerical
- ValueError
If there are no numerical variables in the df or the df is empty
If the variable(s) contain null values
- transform(X)[source]¶
Replaces original variable with the predictions of the tree. The tree outcome is finite, aka, discrete.
- Parameters
- X: pandas dataframe of shape = [n_samples, n_features]
The input samples.
- Returns
- X_transformed: pandas dataframe of shape = [n_samples, n_features]
The dataframe with transformed variables.
- rtype
DataFrame
..
- Raises
- TypeError
If the input is not a Pandas DataFrame
- ValueError
If the variable(s) contain null values
If the dataframe is not of the same size as the one used in fit()
Example¶
In the original article, each feature of the challenge dataset was recoded by training a decision tree of limited depth (2, 3 or 4) using that feature alone, and letting the tree predict the target. The probabilistic predictions of this decision tree were used as an additional feature, that was now linearly (or at least monotonically) correlated with the target.
According to the authors, the addition of these new features had a significant impact on the performance of linear models.
In the following example, we recode 2 numerical variables using decision trees.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.discretisation import DecisionTreeDiscretiser
# Load dataset
data = data = pd.read_csv('houseprice.csv')
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'], test_size=0.3, random_state=0)
# set up the discretisation transformer
disc = DecisionTreeDiscretiser(cv=3,
scoring='neg_mean_squared_error',
variables=['LotArea', 'GrLivArea'],
regression=True)
# fit the transformer
disc.fit(X_train, y_train)
# transform the data
train_t= disc.transform(X_train)
test_t= disc.transform(X_test)
disc.binner_dict_
{'LotArea': GridSearchCV(cv=3, error_score='raise-deprecating',
estimator=DecisionTreeRegressor(criterion='mse', max_depth=None,
max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort=False, random_state=None,
splitter='best'),
iid='warn', n_jobs=None, param_grid={'max_depth': [1, 2, 3, 4]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring='neg_mean_squared_error', verbose=0),
'GrLivArea': GridSearchCV(cv=3, error_score='raise-deprecating',
estimator=DecisionTreeRegressor(criterion='mse', max_depth=None,
max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort=False, random_state=None,
splitter='best'),
iid='warn', n_jobs=None, param_grid={'max_depth': [1, 2, 3, 4]},
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring='neg_mean_squared_error', verbose=0)}
# with tree discretisation, each bin does not necessarily contain
# the same number of observations.
train_t.groupby('GrLivArea')['GrLivArea'].count().plot.bar()
plt.ylabel('Number of houses')