MathematicalCombination¶
API Reference¶
- class feature_engine.creation.MathematicalCombination(variables_to_combine, math_operations=None, new_variables_names=None, missing_values='raise')[source]¶
MathematicalCombination() applies basic mathematical operations to multiple features, returning one or more additional features as a result. That is, it sums, multiplies, takes the average, maximum, minimum or standard deviation of a group of variables, and returns the result into new variables.
For example, if we have the variables number_payments_first_quarter, number_payments_second_quarter, number_payments_third_quarter and number_payments_fourth_quarter, we can use MathematicalCombination() to calculate the total number of payments and mean number of payments as follows:
transformer = MathematicalCombination( variables_to_combine=[ 'number_payments_first_quarter', 'number_payments_second_quarter', 'number_payments_third_quarter', 'number_payments_fourth_quarter' ], math_operations=[ 'sum', 'mean' ], new_variables_name=[ 'total_number_payments', 'mean_number_payments' ] ) Xt = transformer.fit_transform(X)
The transformed X, Xt, will contain the additional features total_number_payments and mean_number_payments, plus the original set of variables.
Attention, if some of the variables to combine have missing data and
missing_values = 'ignore'
, the value will be ignored in the computation. To be clear, if variables A, B and C, have values 10, 20 and NA, and we perform the sum, the result will be A + B = 30.- Parameters
- variables_to_combine: list
The list of numerical variables to be combined.
- math_operations: list, default=None
The list of basic math operations to be used to create the new features.
If None, all of [‘sum’, ‘prod’, ‘mean’, ‘std’, ‘max’, ‘min’] will be performed over the
variables_to_combine
. Alternatively, you can enter the list of operations to carry out.Each operation should be a string and must be one of the elements in
['sum', 'prod', 'mean', 'std', 'max', 'min']
.Each operation will result in a new variable that will be added to the transformed dataset.
- new_variables_names: list, default=None
Names of the newly created variables. You can enter a name or a list of names for the newly created features (recommended). You must enter one name for each mathematical transformation indicated in the
math_operations
parameter. That is, if you want to perform mean and sum of features, you should enter 2 new variable names. If you perform only mean of features, enter 1 variable name. Alternatively, if you chose to perform all mathematical transformations, enter 6 new variable names.The name of the variables indicated by the user should coincide with the order in which the mathematical operations are initialised in the transformer. That is, if you set math_operations = [‘mean’, ‘prod’], the first new variable name will be assigned to the mean of the variables and the second variable name to the product of the variables.
If
new_variable_names = None
, the transformer will assign an arbitrary name to the newly created features starting by the name of the mathematical operation, followed by the variables combined separated by -.- missing_values: string, default=’raise’
Indicates if missing values should be ignored or raised. If ‘raise’ the transformer will return an error if the the datasets to fit or transform contain missing values. If ‘ignore’, missing data will be ignored when performing the calculations.
Attributes
combination_dict_:
Dictionary containing the mathematical operation to new variable name pairs.
math_operations_:
List with the mathematical operations to be applied to the
variables_to_combine
.n_features_in_:
The number of features in the train set used in fit.
Notes
Although the transformer in essence allows us to combine any feature with any of the allowed mathematical operations, its used is intended mostly for the creation of new features based on some domain knowledge. Typical examples within the financial sector are:
Sum debt across financial products, i.e., credit cards, to obtain the total debt.
Take the average payments to various financial products per month.
Find the Minimum payment done at any one month.
In insurance, we can sum the damage to various parts of a car to obtain the total damage.
Methods
fit:
This transformer does not learn parameters.
transform:
Combine the variables with the mathematical operations.
fit_transform:
Fit to the data, then transform it.
- fit(X, y=None)[source]¶
This transformer does not learn parameters.
Perform dataframe checks. Creates dictionary of operation to new feature name pairs.
- Parameters
- X: pandas dataframe of shape = [n_samples, n_features]
The training input samples. Can be the entire dataframe, not just the variables to transform.
- y: pandas Series, or np.array. Defaults to None.
It is not needed in this transformer. You can pass y or None.
- Returns
- self
- Raises
- TypeError
If the input is not a Pandas DataFrame
If any user provided variables in variables_to_combine are not numerical
- ValueError
If the variable(s) contain null values when missing_values = raise
- transform(X)[source]¶
Combine the variables with the mathematical operations.
- Parameters
- X: pandas dataframe of shape = [n_samples, n_features]
The data to transform.
- Returns
- X: Pandas dataframe, shape = [n_samples, n_features + n_operations]
The dataframe with the original variables plus the new variables.
- rtype
DataFrame
..
- Raises
- TypeError
If the input is not a Pandas DataFrame
- ValueError
If the variable(s) contain null values when missing_values = raise
If the dataframe is not of the same size as that used in fit()
Example¶
MathematicalCombination() applies basic mathematical operations to multiple features, returning one or more additional features as a result. That is, it sums, multiplies, takes the average, maximum, minimum or standard deviation of a group of variables and returns the result into new variables.
In this example, we sum 2 variables from the house prices dataset.
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.creation import MathematicalCombination
data = pd.read_csv('houseprice.csv').fillna(0)
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0
)
math_combinator = MathematicalCombination(
variables_to_combine=['LotFrontage', 'LotArea'],
math_operations = ['sum'],
new_variables_names = ['LotTotal']
)
math_combinator.fit(X_train, y_train)
X_train_ = math_combinator.transform(X_train)
print(math_combinator.combination_dict_)
{'LotTotal': 'sum'}
print(X_train_.loc[:,['LotFrontage', 'LotArea', 'LotTotal']].head())
LotFrontage LotArea LotTotal
64 0.0 9375 9375.0
682 0.0 2887 2887.0
960 50.0 7207 7257.0
1384 60.0 9060 9120.0
1100 60.0 8400 8460.0