MathematicalCombination

API Reference

class feature_engine.creation.MathematicalCombination(variables_to_combine, math_operations=None, new_variables_names=None, missing_values='raise')[source]

MathematicalCombination() applies basic mathematical operations to multiple features, returning one or more additional features as a result. That is, it sums, multiplies, takes the average, maximum, minimum or standard deviation of a group of variables, and returns the result into new variables.

For example, if we have the variables number_payments_first_quarter, number_payments_second_quarter, number_payments_third_quarter and number_payments_fourth_quarter, we can use MathematicalCombination() to calculate the total number of payments and mean number of payments as follows:

transformer = MathematicalCombination(
    variables_to_combine=[
        'number_payments_first_quarter',
        'number_payments_second_quarter',
        'number_payments_third_quarter',
        'number_payments_fourth_quarter'
    ],
    math_operations=[
        'sum',
        'mean'
    ],
    new_variables_name=[
        'total_number_payments',
        'mean_number_payments'
    ]
)

Xt = transformer.fit_transform(X)

The transformed X, Xt, will contain the additional features total_number_payments and mean_number_payments, plus the original set of variables.

Attention, if some of the variables to combine have missing data and missing_values = 'ignore', the value will be ignored in the computation. To be clear, if variables A, B and C, have values 10, 20 and NA, and we perform the sum, the result will be A + B = 30.

Parameters
variables_to_combine: list

The list of numerical variables to be combined.

math_operations: list, default=None

The list of basic math operations to be used to create the new features.

If None, all of [‘sum’, ‘prod’, ‘mean’, ‘std’, ‘max’, ‘min’] will be performed over the variables_to_combine. Alternatively, you can enter the list of operations to carry out.

Each operation should be a string and must be one of the elements in ['sum', 'prod', 'mean', 'std', 'max', 'min'].

Each operation will result in a new variable that will be added to the transformed dataset.

new_variables_names: list, default=None

Names of the newly created variables. You can enter a name or a list of names for the newly created features (recommended). You must enter one name for each mathematical transformation indicated in the math_operations parameter. That is, if you want to perform mean and sum of features, you should enter 2 new variable names. If you perform only mean of features, enter 1 variable name. Alternatively, if you chose to perform all mathematical transformations, enter 6 new variable names.

The name of the variables indicated by the user should coincide with the order in which the mathematical operations are initialised in the transformer. That is, if you set math_operations = [‘mean’, ‘prod’], the first new variable name will be assigned to the mean of the variables and the second variable name to the product of the variables.

If new_variable_names = None, the transformer will assign an arbitrary name to the newly created features starting by the name of the mathematical operation, followed by the variables combined separated by -.

missing_values: string, default=’raise’

Indicates if missing values should be ignored or raised. If ‘raise’ the transformer will return an error if the the datasets to fit or transform contain missing values. If ‘ignore’, missing data will be ignored when performing the calculations.

Attributes

combination_dict_:

Dictionary containing the mathematical operation to new variable name pairs.

math_operations_:

List with the mathematical operations to be applied to the variables_to_combine.

n_features_in_:

The number of features in the train set used in fit.

Notes

Although the transformer in essence allows us to combine any feature with any of the allowed mathematical operations, its used is intended mostly for the creation of new features based on some domain knowledge. Typical examples within the financial sector are:

  • Sum debt across financial products, i.e., credit cards, to obtain the total debt.

  • Take the average payments to various financial products per month.

  • Find the Minimum payment done at any one month.

In insurance, we can sum the damage to various parts of a car to obtain the total damage.

Methods

fit:

This transformer does not learn parameters.

transform:

Combine the variables with the mathematical operations.

fit_transform:

Fit to the data, then transform it.

fit(X, y=None)[source]

This transformer does not learn parameters.

Perform dataframe checks. Creates dictionary of operation to new feature name pairs.

Parameters
X: pandas dataframe of shape = [n_samples, n_features]

The training input samples. Can be the entire dataframe, not just the variables to transform.

y: pandas Series, or np.array. Defaults to None.

It is not needed in this transformer. You can pass y or None.

Returns
self
Raises
TypeError
  • If the input is not a Pandas DataFrame

  • If any user provided variables in variables_to_combine are not numerical

ValueError

If the variable(s) contain null values when missing_values = raise

transform(X)[source]

Combine the variables with the mathematical operations.

Parameters
X: pandas dataframe of shape = [n_samples, n_features]

The data to transform.

Returns
X: Pandas dataframe, shape = [n_samples, n_features + n_operations]

The dataframe with the original variables plus the new variables.

rtype

DataFrame ..

Raises
TypeError

If the input is not a Pandas DataFrame

ValueError
  • If the variable(s) contain null values when missing_values = raise

  • If the dataframe is not of the same size as that used in fit()

Example

MathematicalCombination() applies basic mathematical operations to multiple features, returning one or more additional features as a result. That is, it sums, multiplies, takes the average, maximum, minimum or standard deviation of a group of variables and returns the result into new variables.

In this example, we sum 2 variables from the house prices dataset.

import pandas as pd
from sklearn.model_selection import train_test_split

from feature_engine.creation import MathematicalCombination

data = pd.read_csv('houseprice.csv').fillna(0)

X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['Id', 'SalePrice'], axis=1),
    data['SalePrice'],
    test_size=0.3,
    random_state=0
)

math_combinator = MathematicalCombination(
    variables_to_combine=['LotFrontage', 'LotArea'],
    math_operations = ['sum'],
    new_variables_names = ['LotTotal']
)

math_combinator.fit(X_train, y_train)
X_train_ = math_combinator.transform(X_train)
print(math_combinator.combination_dict_)
{'LotTotal': 'sum'}
print(X_train_.loc[:,['LotFrontage', 'LotArea', 'LotTotal']].head())
      LotFrontage  LotArea  LotTotal
64            0.0     9375    9375.0
682           0.0     2887    2887.0
960          50.0     7207    7257.0
1384         60.0     9060    9120.0
1100         60.0     8400    8460.0