MathematicalCombination¶
API Reference¶
-
class
feature_engine.creation.
MathematicalCombination
(variables_to_combine, math_operations=None, new_variables_names=None)[source]¶ MathematicalCombination() applies basic mathematical operations to multiple features, returning one or more additional features as a result. That is, it sums, multiplies, takes the average, maximum, minimum or standard deviation of a group of variables and returns the result into new variables.
For example, if we have the variables number_payments_first_quarter, number_payments_second_quarter, number_payments_third_quarter and number_payments_fourth_quarter, we can use MathematicalCombination() to calculate the total number of payments and mean number of payments as follows:
transformer = MathematicalCombination( variables_to_combine=[ 'number_payments_first_quarter', 'number_payments_second_quarter', 'number_payments_third_quarter', 'number_payments_fourth_quarter' ], math_operations=[ 'sum', 'mean' ], new_variables_name=[ 'total_number_payments', 'mean_number_payments' ] ) Xt = transformer.fit_transform(X)
The transformed X, Xt, will contain the additional features total_number_payments and mean_number_payments, plus the original set of variables.
- Parameters
- variables_to_combinelist
The list of numerical variables to be combined.
- math_operationslist, default=None
The list of basic math operations to be used to create the new features.
If None, all of [‘sum’, ‘prod’, ‘mean’, ‘std’, ‘max’, ‘min’] will be performed over the
variables_to_combine
. Alternatively, the user can enter the list of operations to carry out.Each operation should be a string and must be one of the elements from the list: [‘sum’, ‘prod’, ‘mean’, ‘std’, ‘max’, ‘min’]
Each operation will result in a new variable that will be added to the transformed dataset.
- new_variables_nameslist, default=None
Names of the newly created variables. The user can enter a name or a list of names for the newly created features (recommended). The user must enter one name for each mathematical transformation indicated in the
math_operations
parameter. That is, if you want to perform mean and sum of features, you should enter 2 new variable names. If you perform only mean of features, enter 1 variable name. Alternatively, if you chose to perform all mathematical transformations, enter 6 new variable names.The name of the variables indicated by the user should coincide with the order in which the mathematical operations are initialised in the transformer. That is, if you set math_operations = [‘mean’, ‘prod’], the first new variable name will be assigned to the mean of the variables and the second variable name to the product of the variables.
If
new_variable_names = None
, the transformer will assign an arbitrary name to the newly created features starting by the name of the mathematical operation, followed by the variables combined separated by -.
Attributes
combination_dict_ :
Dictionary containing the mathematical operation to column name pairs
math_operations_ :
List with the mathematical operations to be applied to the
variables_to_combine
.Notes
Although the transformer in essence allows us to combine any feature with any of the allowed mathematical operations, its used is intended mostly for the creation of new features based on some domain knowledge. Typical examples within the financial sector are:
Sum debt across financial products, i.e., credit cards, to obtain the total debt.
Take the average payments to various financial products per month.
Find the Minimum payment done at any one month.
In insurance, we can sum the damage to various parts of a car to obtain the total damage.
Methods
fit:
This transformer does not learn parameters.
transform:
Combine the variables with the mathematical operations.
fit_transform:
Fit to the data, then transform it.
-
fit
(X, y=None)[source]¶ This transformer does not learn parameters.
Perform dataframe checks. Creates dictionary of operation to new feature name pairs.
- Parameters
- Xpandas dataframe of shape = [n_samples, n_features]
The training input samples. Can be the entire dataframe, not just the variables to transform.
- ypandas Series, or np.array. Defaults to None.
It is not needed in this transformer. You can pass y or None.
- Returns
- self
- Raises
- TypeError
If the input is not a Pandas DataFrame
If any user provided variables in variables_to_combine are not numerical
- ValueError
If the variable(s) contain null values
-
transform
(X)[source]¶ Combine the variables with the mathematical operations.
- Parameters
- Xpandas dataframe of shape = [n_samples, n_features]
The data to transform.
- Returns
- XPandas dataframe, shape = [n_samples, n_features + n_operations]
The dataframe with the original variables plus the new variables.
- rtype
DataFrame
..
- Raises
- TypeError
If the input is not a Pandas DataFrame
- ValueError
If the variable(s) contain null values
If the dataframe is not of the same size as that used in fit()
Example¶
MathematicalCombination() applies basic mathematical operations to multiple features, returning one or more additional features as a result. That is, it sums, multiplies, takes the average, maximum, minimum or standard deviation of a group of variables and returns the result into new variables.
In this example, we sum 2 variables from the house prices dataset.
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.creation import MathematicalCombination
data = pd.read_csv('houseprice.csv').fillna(0)
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0
)
math_combinator = MathematicalCombination(
variables_to_combine=['LotFrontage', 'LotArea'],
math_operations = ['sum'],
new_variables_names = ['LotTotal']
)
math_combinator.fit(X_train, y_train)
X_train_ = math_combinator.transform(X_train)
print(math_combinator.combination_dict_)
{'LotTotal': 'sum'}
print(X_train_.loc[:,['LotFrontage', 'LotArea', 'LotTotal']].head())
LotFrontage LotArea LotTotal
64 0.0 9375 9375.0
682 0.0 2887 2887.0
960 50.0 7207 7257.0
1384 60.0 9060 9120.0
1100 60.0 8400 8460.0