UserInputDiscretiser

The UserInputDiscretiser() sorts the variable values into contiguous intervals which limits are arbitrarily defined by the user.

The user must provide a dictionary of variable:list of limits pair when setting up the discretiser.

The UserInputDiscretiser() works only with numerical variables. The discretiser will check that the variables entered by the user are present in the train set and cast as numerical.

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from feature_engine.discretisers import UserInputDiscretiser

boston_dataset = load_boston()
data = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)

user_dict = {'LSTAT': [0, 10, 20, 30, np.Inf]}

transformer = UserInputDiscretiser(
    binning_dict=user_dict, return_object=False, return_boundaries=False)
X = transformer.fit_transform(data)

X['LSTAT'].head()
'LotArea': [-inf,
  22694.5,
  44089.0,
  65483.5,
  86878.0,
  108272.5,
  129667.0,
  151061.5,
  172456.0,
  193850.5,
  inf],
 'GrLivArea': [-inf,
  768.2,
  1202.4,
  1636.6,
  2070.8,
  2505.0,
  2939.2,
  3373.4,
  3807.6,
  4241.799999999999,
  inf]}
0    0
1    0
2    0
3    0
4    0
Name: LSTAT, dtype: int64

API Reference

class feature_engine.discretisers.UserInputDiscretiser(binning_dict, return_object=False, return_boundaries=False)[source]

The UserInputDiscretiser() divides continuous numerical variables into contiguous intervals are arbitrarily entered by the user.

The user needs to enter a dictionary with variable names as keys, and a list of the limits of the intervals as values. For example {‘var1’:[0, 10, 100, 1000], ‘var2’:[5, 10, 15, 20]}.

The UserInputDiscretiser() works only with numerical variables. The discretiser will check if the dictionary entered by the user contains variables present in the training set, and if these variables are cast as numerical, before doing any transformation.

Then it transforms the variables, that is, it sorts the values into the intervals, transform.

Parameters
  • binning_dict (dict) – The dictionary with the variable : interval limits pairs, provided by the user. A valid dictionary looks like this: {‘var1’:[0, 10, 100, 1000], ‘var2’:[5, 10, 15, 20]}.

  • return_object (bool, default=False) – Whether the numbers in the discrete variable should be returned as numeric or as object. The decision is made by the user based on whether they would like to proceed the engineering of the variable as if it was numerical or categorical.

  • return_boundaries (bool, default=False) – whether the output should be the interval boundaries. If True, it returns the interval boundaries. If False, it returns integers.

fit(X, y=None)[source]

Checks that the user entered variables are in the train set and cast as numerical.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to be transformed.

  • y (None) – y is not needed in this encoder. You can pass y or None.

binner_dict\_

The dictionary containing the {variable: interval limits} pairs used to sort the values into discrete intervals.

Type

dictionary

transform(X)[source]

Sorts the variable values into the intervals.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The input samples.

Returns

X_transformed – The transformed data with the discrete variables.

Return type

pandas dataframe of shape = [n_samples, n_features]