MathFeatures#

MathFeatures() applies basic functions to groups of features, returning one or more additional variables as a result. It uses pandas.agg() to create the features, so in essence, you can pass any function that is accepted by this method. One exception is that MathFeatures() does not accept dictionaries for the parameter func.

The functions can be passed as strings, numpy methods, i.e., np.mean, or any function that you create, as long as, it returns a scalar from a vector.

For supported aggregation functions, see pandas documentation.

As an example, if we have the variables:

  • number_payments_first_quarter

  • number_payments_second_quarter

  • number_payments_third_quarter

  • number_payments_fourth_quarter

we can use MathFeatures() to calculate the total number of payments and mean number of payments as follows:

transformer = MathFeatures(
    variables=[
        'number_payments_first_quarter',
        'number_payments_second_quarter',
        'number_payments_third_quarter',
        'number_payments_fourth_quarter'
    ],
    func=['sum','mean'],
    new_variables_name=[
        'total_number_payments',
        'mean_number_payments'
    ]
)

Xt = transformer.fit_transform(X)

The transformed dataset, Xt, will contain the additional features total_number_payments and mean_number_payments, plus the original set of variables.

The variable total_number_payments is obtained by adding up the features indicated in variables, whereas the variable mean_number_payments is the mean of those 4 features.

Examples#

Let’s dive into how we can use MathFeatures() in more details. Let’s first create a toy dataset:

import numpy as np
import pandas as pd
from feature_engine.creation import MathFeatures

df = pd.DataFrame.from_dict(
    {
        "Name": ["tom", "nick", "krish", "jack"],
        "City": ["London", "Manchester", "Liverpool", "Bristol"],
        "Age": [20, 21, 19, 18],
        "Marks": [0.9, 0.8, 0.7, 0.6],
        "dob": pd.date_range("2020-02-24", periods=4, freq="T"),
    })

print(df)

The dataset looks like this:

    Name        City  Age  Marks                 dob
0    tom      London   20    0.9 2020-02-24 00:00:00
1   nick  Manchester   21    0.8 2020-02-24 00:01:00
2  krish   Liverpool   19    0.7 2020-02-24 00:02:00
3   jack     Bristol   18    0.6 2020-02-24 00:03:00

We can now apply several functions over the numerical variables Age and Marks using strings to indicate the functions:

transformer = MathFeatures(
    variables=["Age", "Marks"],
    func = ["sum", "prod", "min", "max", "std"],
)

df_t = transformer.fit_transform(df)

print(df_t)

And we obtain the following dataset, where the new variables are named after the function used to obtain them, plus the group of variables that were used in the computation:

    Name        City  Age  Marks                 dob  sum_Age_Marks  \
0    tom      London   20    0.9 2020-02-24 00:00:00           20.9
1   nick  Manchester   21    0.8 2020-02-24 00:01:00           21.8
2  krish   Liverpool   19    0.7 2020-02-24 00:02:00           19.7
3   jack     Bristol   18    0.6 2020-02-24 00:03:00           18.6

   prod_Age_Marks  min_Age_Marks  max_Age_Marks  std_Age_Marks
0            18.0            0.9           20.0      13.505740
1            16.8            0.8           21.0      14.283557
2            13.3            0.7           19.0      12.940054
3            10.8            0.6           18.0      12.303658

For more flexibility, we can pass existing functions to the func argument as follows:

transformer = MathFeatures(
    variables=["Age", "Marks"],
    func = [np.sum, np.prod, np.min, np.max, np.std],
)

df_t = transformer.fit_transform(df)

print(df_t)

And we obtain the following dataframe:

    Name        City  Age  Marks                 dob  sum_Age_Marks  \
0    tom      London   20    0.9 2020-02-24 00:00:00           20.9
1   nick  Manchester   21    0.8 2020-02-24 00:01:00           21.8
2  krish   Liverpool   19    0.7 2020-02-24 00:02:00           19.7
3   jack     Bristol   18    0.6 2020-02-24 00:03:00           18.6

   prod_Age_Marks  amin_Age_Marks  amax_Age_Marks  std_Age_Marks
0            18.0             0.9            20.0      13.505740
1            16.8             0.8            21.0      14.283557
2            13.3             0.7            19.0      12.940054
3            10.8             0.6            18.0      12.303658

We have the option to set the parameter drop_original to True to drop the variables after performing the calculations.

We can obtain the names of all the features in the transformed data as follows:

transformer.get_feature_names_out(input_features=None)

Which will return the names of all the variables in the transformed data:

['Name',
 'City',
 'Age',
 'Marks',
 'dob',
 'sum_Age_Marks',
 'prod_Age_Marks',
 'amin_Age_Marks',
 'amax_Age_Marks',
 'std_Age_Marks']

Or, we can obtain the names of the new variables only:

transformer.get_feature_names_out(input_features=True)

Which will return the names of the new features:

['sum_Age_Marks',
 'prod_Age_Marks',
 'amin_Age_Marks',
 'amax_Age_Marks',
 'std_Age_Marks']

New variables names#

Even though the transfomer allows to combine variables automatically, its use is intended to combine variables with domain knowledge. In this case, we normally want to give meaningful names to the variables. We can do so through the parameter new_variables_names.

new_variables_names takes a list of strings, with the new variable names. In this parameter, you need to enter a list of names for the newly created features. You must enter one name for each function indicated in the func parameter. That is, if you want to perform mean and sum of features, you should enter 2 new variable names. If you compute only the mean of features, enter 1 variable name.

The name of the variables should coincide with the order of the functions in func. That is, if you set func = ['mean', 'prod'], the first new variable name will be assigned to the mean of the variables and the second variable name to the product of the variables.

Let’s look at an example. In the following code snippet, we add up, and find the maximum and minimum value of 2 variables, which results in 3 new features. We add the names for the new features in a list:

transformer = MathFeatures(
    variables=["Age", "Marks"],
    func = ["sum", "min", "max"],
    new_variables_names = ["sum_vars", "min_vars", "max_vars"]
)

df_t = transformer.fit_transform(df)

print(df_t)

The resulting dataframe contains the new features under the variable names that we provided:

    Name        City  Age  Marks                 dob  sum_vars  min_vars  \
0    tom      London   20    0.9 2020-02-24 00:00:00      20.9       0.9
1   nick  Manchester   21    0.8 2020-02-24 00:01:00      21.8       0.8
2  krish   Liverpool   19    0.7 2020-02-24 00:02:00      19.7       0.7
3   jack     Bristol   18    0.6 2020-02-24 00:03:00      18.6       0.6

   max_vars
0      20.0
1      21.0
2      19.0
3      18.0

More details#

You can find creative ways to use MathFeatures() in the following Jupyter notebooks.

All notebooks can be found in a dedicated repository.