Quick Start

If you’re new to Feature-engine this guide will get you started. Feature-engine transformers have the methods fit() and transform() to learn parameters from the data and then modify the data. They work just like any Scikit- learn transformer.

Installation

Feature-engine is a Python 3 package and works well with 3.6 or later. Earlier versions have not been tested. The simplest way to install Feature-engine is from PyPI with pip, Python’s preferred package installer.

$ pip install feature-engine

Note that Feature-engine is an active project and routinely publishes new releases. In order to upgrade Feature-engine to the latest version, use pip as follows.

$ pip install -U feature-engine

You can also use the -U flag to update Scikit-learn, pandas, NumPy, or any other libraries that work well with Feature-engine to their latest versions.

Once installed, you should be able to import Feature-engine without an error, both in Python and inside of Jupyter notebooks.

Example Use

This is an example of how to use Feature-engine’s transformers to perform missing data imputation.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

import feature_engine.missing_data_imputers as mdi

# Load dataset
data = pd.read_csv('houseprice.csv')

# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0)

# set up the imputer
median_imputer = mdi.MeanMedianImputer(imputation_method='median',
                                       variables=['LotFrontage', 'MasVnrArea'])
# fit the imputer
median_imputer.fit(X_train)

# transform the data
train_t= median_imputer.transform(X_train)
test_t= median_imputer.transform(X_test)

fig = plt.figure()
ax = fig.add_subplot(111)
X_train['LotFrontage'].plot(kind='kde', ax=ax)
train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red')
lines, labels = ax.get_legend_handles_labels()
ax.legend(lines, labels, loc='best')
_images/medianimputation1.png

More examples can be found in the documentation for each transformer and in a dedicated section in the repository with Jupyter notebooks.

Feature-engine with Scikit-learn’s pipeline

Feature-engine’s transformers can be assembled within a Scikit-learn pipeline. This way, we can store our feature engineering pipeline in one object and save it in one pickle (.pkl). Here is an example on how to do it:

    from math import sqrt
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt

    from sklearn.linear_model import Lasso
    from sklearn.metrics import mean_squared_error
    from sklearn.model_selection import train_test_split
    from sklearn.pipeline import Pipeline as pipe
    from sklearn.preprocessing import MinMaxScaler

    from feature_engine import categorical_encoders as ce
    from feature_engine import discretisers as dsc
    from feature_engine import missing_data_imputers as mdi

    # load dataset
    data = pd.read_csv('houseprice.csv')

    # drop some variables
    data.drop(labels=['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'Id'], axis=1, inplace=True)

# make a list of categorical variables
categorical = [var for var in data.columns if data[var].dtype == 'O']

# make a list of numerical variables
numerical = [var for var in data.columns if data[var].dtype != 'O']

# make a list of discrete variables
discrete = [ var for var in numerical if len(data[var].unique()) < 20]

    # categorical encoders work only with object type variables
# to treat numerical variables as categorical, we need to re-cast them
    data[discrete]= data[discrete].astype('O')

# continuous variables
numerical = [
    var for var in numerical if var not in discrete
    and var not in ['Id', 'SalePrice']
    ]

    # separate into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(data.drop(labels=['SalePrice'], axis=1),
                                                        data.SalePrice,
                                                        test_size=0.1,
                                                        random_state=0)

    # set up the pipeline
    price_pipe = pipe([
        # add a binary variable to indicate missing information for the 2 variables below
        ('continuous_var_imputer', mdi.AddMissingIndicator(variables = ['LotFrontage'])),

        # replace NA by the median in the 2 variables below, they are numerical
        ('continuous_var_median_imputer', mdi.MeanMedianImputer(
                            imputation_method='median', variables = ['LotFrontage', 'MasVnrArea'])),

        # replace NA by adding the label "Missing" in categorical variables
        ('categorical_imputer', mdi.CategoricalVariableImputer(variables = categorical)),

        # disretise continuous variables using trees
        ('numerical_tree_discretiser', dsc.DecisionTreeDiscretiser(
                cv = 3, scoring='neg_mean_squared_error', variables = numerical, regression=True)),

        # remove rare labels in categorical and discrete variables
        ('rare_label_encoder', ce.RareLabelCategoricalEncoder(
                                    tol = 0.03, n_categories=1, variables = categorical+discrete)),

        # encode categorical and discrete variables using the target mean
        ('categorical_encoder', ce.MeanCategoricalEncoder(variables = categorical+discrete)),

        # scale features
        ('scaler', MinMaxScaler()),

        # Lasso
        ('lasso', Lasso(random_state=2909, alpha=0.005))
         ])

    # train feature engineering transformers and Lasso
    price_pipe.fit(X_train, np.log(y_train))

    # predict
    pred_train = price_pipe.predict(X_train)
    pred_test = price_pipe.predict(X_test)

    # Evaluate
    print('Lasso Linear Model train mse: {}'.format(mean_squared_error(y_train, np.exp(pred_train))))
    print('Lasso Linear Model train rmse: {}'.format(sqrt(mean_squared_error(y_train, np.exp(pred_train)))))
    print()
    print('Lasso Linear Model test mse: {}'.format(mean_squared_error(y_test, np.exp(pred_test))))
    print('Lasso Linear Model test rmse: {}'.format(sqrt(mean_squared_error(y_test, np.exp(pred_test)))))
Lasso Linear Model train mse: 949189263.8948538
Lasso Linear Model train rmse: 30808.9153313591

Lasso Linear Model test mse: 1344649485.0641894
Lasso Linear Model train rmse: 36669.46256852136
plt.scatter(y_test, np.exp(pred_test))
plt.xlabel('True Price')
plt.ylabel('Predicted Price')
plt.show()
_images/pipelineprediction.png

More examples can be found in the documentation for each transformer and in a dedicated section of Jupyter notebooks.

Dataset attribution

The user guide and examples included in Feature-engine’s documentation are based on these 3 datasets:

Titanic dataset

We use the dataset available in openML which can be downloaded from here.

Ames House Prices dataset

We use the data set created by Professor Dean De Cock: * Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing * Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3.

The examples are based on a copy of the dataset available on Kaggle.

The original data and documentation can be found here:

Credit Approval dataset

We use the Credit Approval dataset from the UCI Machine Learning Repository:

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.

To download the dataset visit this website and click on “crx.data” to download the data set.

To prepare the data for the examples:

import random
import pandas as pd
import numpy as np

# load data
data = pd.read_csv('crx.data', header=None)

# create variable names according to UCI Machine Learning information
varnames = ['A'+str(s) for s in range(1,17)]
data.columns = varnames

# replace ? by np.nan
data = data.replace('?', np.nan)

# re-cast some variables to the correct types
data['A2'] = data['A2'].astype('float')
data['A14'] = data['A14'].astype('float')

# encode target to binary
data['A16'] = data['A16'].map({'+':1, '-':0})

# save the data
data.to_csv('creditApprovalUCI.csv', index=False)