MathematicalCombination() applies basic mathematical operations to multiple
features, returning one or more additional features as a result. That is, it sums,
multiplies, takes the average, finds the maximum, minimum or standard deviation of a
group of variables and returns the result into new variables.
For example, if we have the variables:
we can use
MathematicalCombination() to calculate the total number of payments
and mean number of payments as follows:
transformer = MathematicalCombination( variables_to_combine=[ 'number_payments_first_quarter', 'number_payments_second_quarter', 'number_payments_third_quarter', 'number_payments_fourth_quarter' ], math_operations=[ 'sum', 'mean' ], new_variables_name=[ 'total_number_payments', 'mean_number_payments' ] ) Xt = transformer.fit_transform(X)
The transformed dataset, Xt, will contain the additional features total_number_payments and mean_number_payments, plus the original set of variables.
The variable total_number_payments is obtained by adding up the features
variables_to_combine, whereas the variable mean_number_payments is
the mean of those 4 features.
Below we show another example using the House Prices Dataset (more details about the dataset here). In this example, we sum 2 variables: ‘LotFrontage’ and ‘LotArea’ to obtain ‘LotTotal’.
import pandas as pd from sklearn.model_selection import train_test_split from feature_engine.creation import MathematicalCombination data = pd.read_csv('houseprice.csv').fillna(0) X_train, X_test, y_train, y_test = train_test_split( data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0 ) math_combinator = MathematicalCombination( variables_to_combine=['LotFrontage', 'LotArea'], math_operations = ['sum'], new_variables_names = ['LotTotal'] ) math_combinator.fit(X_train, y_train) X_train_ = math_combinator.transform(X_train)
In the attribute
combination_dict_ the transformer stores the variable name and the
operation used to obtain that variable. This way, we can easily identify which variable
is the result of which transformation.
We can see that the transformed dataset contains the additional variable:
print(X_train_.loc[:,['LotFrontage', 'LotArea', 'LotTotal']].head())
LotFrontage LotArea LotTotal 64 0.0 9375 9375.0 682 0.0 2887 2887.0 960 50.0 7207 7257.0 1384 60.0 9060 9120.0 1100 60.0 8400 8460.0
Even though the transfomer allows to combine variables automatically, it was originally
designed to combine variables with domain knowledge. In this case, we normally want to
give meaningful names to the variables. We can do so through the parameter
new_variables_names takes a list of strings, with the new variable names. In this
parameter, you need to enter a name or a list of names for the newly created features
(recommended). You must enter one name for each mathematical transformation indicated
math_operations parameter. That is, if you want to perform mean and sum of
features, you should enter 2 new variable names. If you perform only mean of features,
enter 1 variable name. Alternatively, if you chose to perform all mathematical
transformations, enter 6 new variable names.
The name of the variables should coincide with the order in which the mathematical operations are initialised in the transformer. That is, if you set math_operations = [‘mean’, ‘prod’], the first new variable name will be assigned to the mean of the variables and the second variable name to the product of the variables.