.. _math_features: .. currentmodule:: feature_engine.creation MathFeatures ============ :class:`MathFeatures()` applies basic functions to groups of features, returning one or more additional variables as a result. It uses `pandas.agg()` to create the features, so in essence, you can pass any function that is accepted by this method. One exception is that :class:`MathFeatures()` does not accept dictionaries for the parameter `func`. The functions can be passed as strings, numpy methods, i.e., np.mean, or any function that you create, as long as, it returns a scalar from a vector. For supported aggregation functions, see `pandas documentation `_. As an example, if we have the variables: - **number_payments_first_quarter** - **number_payments_second_quarter** - **number_payments_third_quarter** - **number_payments_fourth_quarter** we can use :class:`MathFeatures()` to calculate the total number of payments and mean number of payments as follows: .. code-block:: python transformer = MathFeatures( variables=[ 'number_payments_first_quarter', 'number_payments_second_quarter', 'number_payments_third_quarter', 'number_payments_fourth_quarter' ], func=['sum','mean'], new_variables_name=[ 'total_number_payments', 'mean_number_payments' ] ) Xt = transformer.fit_transform(X) The transformed dataset, Xt, will contain the additional features **total_number_payments** and **mean_number_payments**, plus the original set of variables. The variable **total_number_payments** is obtained by adding up the features indicated in `variables`, whereas the variable **mean_number_payments** is the mean of those 4 features. Examples -------- Let's dive into how we can use :class:`MathFeatures()` in more details. Let's first create a toy dataset: .. code:: python import numpy as np import pandas as pd from feature_engine.creation import MathFeatures df = pd.DataFrame.from_dict( { "Name": ["tom", "nick", "krish", "jack"], "City": ["London", "Manchester", "Liverpool", "Bristol"], "Age": [20, 21, 19, 18], "Marks": [0.9, 0.8, 0.7, 0.6], "dob": pd.date_range("2020-02-24", periods=4, freq="T"), }) print(df) The dataset looks like this: .. code:: python Name City Age Marks dob 0 tom London 20 0.9 2020-02-24 00:00:00 1 nick Manchester 21 0.8 2020-02-24 00:01:00 2 krish Liverpool 19 0.7 2020-02-24 00:02:00 3 jack Bristol 18 0.6 2020-02-24 00:03:00 We can now apply several functions over the numerical variables Age and Marks using strings to indicate the functions: .. code:: python transformer = MathFeatures( variables=["Age", "Marks"], func = ["sum", "prod", "min", "max", "std"], ) df_t = transformer.fit_transform(df) print(df_t) And we obtain the following dataset, where the new variables are named after the function used to obtain them, plus the group of variables that were used in the computation: .. code:: python Name City Age Marks dob sum_Age_Marks \ 0 tom London 20 0.9 2020-02-24 00:00:00 20.9 1 nick Manchester 21 0.8 2020-02-24 00:01:00 21.8 2 krish Liverpool 19 0.7 2020-02-24 00:02:00 19.7 3 jack Bristol 18 0.6 2020-02-24 00:03:00 18.6 prod_Age_Marks min_Age_Marks max_Age_Marks std_Age_Marks 0 18.0 0.9 20.0 13.505740 1 16.8 0.8 21.0 14.283557 2 13.3 0.7 19.0 12.940054 3 10.8 0.6 18.0 12.303658 For more flexibility, we can pass existing functions to the `func` argument as follows: .. code:: python transformer = MathFeatures( variables=["Age", "Marks"], func = [np.sum, np.prod, np.min, np.max, np.std], ) df_t = transformer.fit_transform(df) print(df_t) And we obtain the following dataframe: .. code:: python Name City Age Marks dob sum_Age_Marks \ 0 tom London 20 0.9 2020-02-24 00:00:00 20.9 1 nick Manchester 21 0.8 2020-02-24 00:01:00 21.8 2 krish Liverpool 19 0.7 2020-02-24 00:02:00 19.7 3 jack Bristol 18 0.6 2020-02-24 00:03:00 18.6 prod_Age_Marks amin_Age_Marks amax_Age_Marks std_Age_Marks 0 18.0 0.9 20.0 13.505740 1 16.8 0.8 21.0 14.283557 2 13.3 0.7 19.0 12.940054 3 10.8 0.6 18.0 12.303658 We have the option to set the parameter `drop_original` to True to drop the variables after performing the calculations. We can obtain the names of all the features in the transformed data as follows: .. code:: python transformer.get_feature_names_out(input_features=None) Which will return the names of all the variables in the transformed data: .. code:: python ['Name', 'City', 'Age', 'Marks', 'dob', 'sum_Age_Marks', 'prod_Age_Marks', 'amin_Age_Marks', 'amax_Age_Marks', 'std_Age_Marks'] New variables names ^^^^^^^^^^^^^^^^^^^ Even though the transfomer allows to combine variables automatically, its use is intended to combine variables with domain knowledge. In this case, we normally want to give meaningful names to the variables. We can do so through the parameter `new_variables_names`. `new_variables_names` takes a list of strings, with the new variable names. In this parameter, you need to enter a list of names for the newly created features. You must enter one name for each function indicated in the `func` parameter. That is, if you want to perform mean and sum of features, you should enter 2 new variable names. If you compute only the mean of features, enter 1 variable name. The name of the variables should coincide with the order of the functions in `func`. That is, if you set `func = ['mean', 'prod']`, the first new variable name will be assigned to the mean of the variables and the second variable name to the product of the variables. Let's look at an example. In the following code snippet, we add up, and find the maximum and minimum value of 2 variables, which results in 3 new features. We add the names for the new features in a list: .. code:: python transformer = MathFeatures( variables=["Age", "Marks"], func = ["sum", "min", "max"], new_variables_names = ["sum_vars", "min_vars", "max_vars"] ) df_t = transformer.fit_transform(df) print(df_t) The resulting dataframe contains the new features under the variable names that we provided: .. code:: python Name City Age Marks dob sum_vars min_vars \ 0 tom London 20 0.9 2020-02-24 00:00:00 20.9 0.9 1 nick Manchester 21 0.8 2020-02-24 00:01:00 21.8 0.8 2 krish Liverpool 19 0.7 2020-02-24 00:02:00 19.7 0.7 3 jack Bristol 18 0.6 2020-02-24 00:03:00 18.6 0.6 max_vars 0 20.0 1 21.0 2 19.0 3 18.0 Additional resources -------------------- For more details about this and other feature engineering methods check out these resources: .. figure:: ../../images/feml.png :width: 300 :figclass: align-center :align: left :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning Feature Engineering for Machine Learning | | | | | | | | | | Or read our book: .. figure:: ../../images/cookbook.png :width: 200 :figclass: align-center :align: left :target: https://packt.link/0ewSo Python Feature Engineering Cookbook | | | | | | | | | | | | | Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.