CombineWithReferenceFeature#
DEPRECATED: CombineWithReferenceFeature() is deprecated in version 1.3 and will be removed in Version 1.4. Use RelativeFeatures() instead.
CombineWithReferenceFeature()
combines a group of variables with a group of
reference variables utilizing basic mathematical operations (subtraction, division,
addition and multiplication). It returns one or more additional features in the
dataframe as a result of these operations.
In other words, CombineWithReferenceFeature()
sums, multiplies, subtracts or
divides a group of features (indicated in variables_to_combine
) to or by a group of
reference variables (indicated in reference_variables
), and returns the
result as new variables in the dataframe.
For example, if we have the variables:
number_payments_first_quarter,
number_payments_second_quarter,
number_payments_third_quarter,
number_payments_fourth_quarter, and
total_payments,
we can use CombineWithReferenceFeature()
to determine the percentage of
payments per quarter as follows:
transformer = CombineWithReferenceFeature(
variables_to_combine=[
'number_payments_first_quarter',
'number_payments_second_quarter',
'number_payments_third_quarter',
'number_payments_fourth_quarter',
],
reference_variables=['total_payments'],
operations=['div'],
new_variables_name=[
'perc_payments_first_quarter',
'perc_payments_second_quarter',
'perc_payments_third_quarter',
'perc_payments_fourth_quarter',
]
)
Xt = transformer.fit_transform(X)
The precedent code block will return a new dataframe, Xt, with 4 new variables, those
indicated in new_variables_name
, that are calculated as the division of each one of
the variables in variables_to_combine
and ‘total_payments’.
Below we show another example using the House Prices Dataset (more details about the
dataset here). In this example, we subtract LotFrontage
from
LotArea
.
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.creation import CombineWithReferenceFeature
data = pd.read_csv('houseprice.csv').fillna(0)
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0
)
combinator = CombineWithReferenceFeature(
variables_to_combine=['LotArea'],
reference_variables=['LotFrontage'],
operations = ['sub'],
new_variables_names = ['LotPartial']
)
combinator.fit(X_train, y_train)
X_train = combinator.transform(X_train)
We can see the newly created variable in the following code blocks:
print(X_train[["LotPartial","LotFrontage","LotArea"]].head())
LotTotal LotFrontage LotArea
64 9375.0 0.0 9375
682 2887.0 0.0 2887
960 7157.0 50.0 7207
1384 9000.0 60.0 9060
1100 8340.0 60.0 8400
new_variables_names
Even though the transfomer allows to combine variables automatically, it was originally
designed to combine variables with domain knowledge. In this case, we normally want to
give meaningful names to the variables. We can do so through the parameter
new_variables_names
.
new_variables_names
takes a list of strings, with the new variable names. In this
parameter, you need to enter as many names as new features are created by the
transformer. The number of new features is the number of operations, times the number
of reference variables, times the number of variables to combine.
Thus, if you want to perform 2 operations, sub and div, combining 4 variables with 2 reference variables, you should enter 2 X 4 X 2 new variable names.
The name of the variables should coincide with the order in which the operations are performed by the transformer. The transformer will first carry out ‘sub’, then ‘div’, then ‘add’ and finally ‘mul’.
More details#
You can find creative ways to use the CombineWithReferenceFeature()
in the
following Jupyter notebooks and Kaggle kernels.