DecisionTreeDiscretiser#

class feature_engine.discretisation.DecisionTreeDiscretiser(variables=None, cv=3, scoring='neg_mean_squared_error', param_grid=None, regression=True, random_state=None)[source]#

The DecisionTreeDiscretiser() replaces numerical variables by discrete, i.e., finite variables, which values are the predictions of a decision tree.

The method is inspired by the following article from the winners of the KDD 2009 competition: http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf

The DecisionTreeDiscretiser() trains a decision tree per variable. Then, it transforms the variables, with predictions of the decision tree.

The DecisionTreeDiscretiser() works only with numerical variables. A list of variables to transform can be indicated. Alternatively, the discretiser will automatically select all numerical variables.

More details in the User Guide.

Parameters
variables: list, default=None

The list of numerical variables to transform. If None, the transformer will automatically find and select all numerical variables.

cv: int, cross-validation generator or an iterable, default=3

Determines the cross-validation splitting strategy. Possible inputs for cv are:

For int/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls. For more details check Scikit-learn’s cross_validate’s documentation.

scoring: str, default=’neg_mean_squared_error’

Desired metric to optimise the performance of the tree. Comes from sklearn.metrics. See the DecisionTreeRegressor or DecisionTreeClassifier model evaluation documentation for more options: https://scikit-learn.org/stable/modules/model_evaluation.html

param_grid: dictionary, default=None

The hyperparameters for the decision tree to test with a grid search. The param_grid can contain any of the permitted hyperparameters for Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). If None, then param_grid will optimise the ‘max_depth’ over [1, 2, 3, 4].

regression: boolean, default=True

Indicates whether the discretiser should train a regression or a classification decision tree.

random_stateint, default=None

The random_state to initialise the training of the decision tree. It is one of the parameters of the Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). For reproducibility it is recommended to set the random_state to an integer.

Attributes
binner_dict_:

Dictionary containing the fitted tree per variable.

scores_dict_:

Dictionary with the score of the best decision tree per variable.

variables_:

The group of variables that will be transformed.

feature_names_in_:

List with the names of features seen during fit.

n_features_in_:

The number of features in the train set used in fit.

References

1

Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop and Conference Proceedings 7: 23-34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from feature_engine.discretisation import DecisionTreeDiscretiser
>>> np.random.seed(42)
>>> X = pd.DataFrame(dict(x= np.random.randint(1,100, 100)))
>>> y_reg = pd.Series(np.random.randn(100))
>>> dtd = DecisionTreeDiscretiser(random_state=42)
>>> dtd.fit(X, y_reg)
>>> dtd.transform(X)["x"].value_counts()
-0.090091    90
0.479454    10
Name: x, dtype: int64

You can also apply this for classification problems adjusting the scoring metric.

>>> y_clf = pd.Series(np.random.randint(0,2,100))
>>> dtd = DecisionTreeDiscretiser(regression=False, scoring="f1", random_state=42)
>>> dtd.fit(X, y_clf)
>>> dtd.transform(X)["x"].value_counts()
0.480769    52
0.687500    48
Name: x, dtype: int64

Methods

fit:

Fit a decision tree per variable.

fit_transform:

Fit to data, then transform it.

get_feature_names_out:

Get output feature names for transformation.

get_params:

Get parameters for this estimator.

set_params:

Set the parameters of this estimator.

transform:

Replace continuous variable values by the predictions of the decision tree.

fit(X, y)[source]#

Fit one decision tree per variable to discretize with cross-validation and grid-search for hyperparameters.

Parameters
X: pandas dataframe of shape = [n_samples, n_features]

The training dataset. Can be the entire dataframe, not just the variables to be transformed.

y: pandas series.

Target variable. Required to train the decision tree.

fit_transform(X, y=None, **fit_params)[source]#

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
Xarray-like of shape (n_samples, n_features)

Input samples.

yarray-like of shape (n_samples,) or (n_samples, n_outputs), default=None

Target values (None for unsupervised transformations).

**fit_paramsdict

Additional fit parameters.

Returns
X_newndarray array of shape (n_samples, n_features_new)

Transformed array.

get_feature_names_out(input_features=None)[source]#

Get output feature names for transformation. In other words, returns the variable names of transformed dataframe.

Parameters
input_featuresarray or list, default=None

This parameter exits only for compatibility with the Scikit-learn pipeline.

  • If None, then feature_names_in_ is used as feature names in.

  • If an array or list, then input_features must match feature_names_in_.

Returns
feature_names_out: list

Transformed feature names.

rtype

List[Union[str, int]] ..

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsdict

Parameter names mapped to their values.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters
**paramsdict

Estimator parameters.

Returns
selfestimator instance

Estimator instance.

transform(X)[source]#

Replaces original variable values with the predictions of the tree. The decision tree predictions are finite, aka, discrete.

Parameters
X: pandas dataframe of shape = [n_samples, n_features]

The input samples.

Returns
X_new: pandas dataframe of shape = [n_samples, n_features]

The dataframe with transformed variables.

rtype

DataFrame ..