make_pipeline#

make_pipeline is a shorthand for Pipeline. While to set up Pipeline we create tuples with step names and transformers or estimators, with make_pipeline we just add a sequence of transformers and estimators, and the names will be added automatically.

Setting up a Pipeline with make_pipeline#

In the following example, we set up a Pipeline that drops missing data, then replaces categories with ordinal numbers, and finally fits a Lasso regression model.

import numpy as np
import pandas as pd
from feature_engine.imputation import DropMissingData
from feature_engine.encoding import OrdinalEncoder
from feature_engine.pipeline import make_pipeline

from sklearn.linear_model import Lasso

X = pd.DataFrame(
    dict(
        x1=[2, 1, 1, 0, np.nan],
        x2=["a", np.nan, "b", np.nan, "a"],
    )
)
y = pd.Series([1, 2, 3, 4, 5])

pipe = make_pipeline(
    DropMissingData(),
    OrdinalEncoder(encoding_method="arbitrary"),
    Lasso(random_state=10),
)
# predict
pipe.fit(X, y)
preds_pipe = pipe.predict(X)
preds_pipe

In the output we see the predictions made by the pipeline:

array([2., 2.])

The names of the pipeline were assigned automatically:

print(pipe)
Pipeline(steps=[('dropmissingdata', DropMissingData()),
                ('ordinalencoder', OrdinalEncoder(encoding_method='arbitrary')),
                ('lasso', Lasso(random_state=10))])

The pipeline returned by make_pipeline has exactly the same characteristics than Pipeline. Hence, for additional guidelines, check out the Pipeline documentation.

Forecasting#

Let’s set up another pipeline to do direct forecasting:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.linear_model import Lasso
from sklearn.metrics import root_mean_squared_error
from sklearn.multioutput import MultiOutputRegressor

from feature_engine.timeseries.forecasting import (
    LagFeatures,
    WindowFeatures,
)
from feature_engine.pipeline import make_pipeline

We’ll use the Australia electricity demand dataset described here:

Godahewa, Rakshitha, Bergmeir, Christoph, Webb, Geoff, Hyndman, Rob, & Montero-Manso, Pablo. (2021). Australian Electricity Demand Dataset (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.4659727

url = "https://raw.githubusercontent.com/tidyverts/tsibbledata/master/data-raw/vic_elec/VIC2015/demand.csv"
df = pd.read_csv(url)

df.drop(columns=["Industrial"], inplace=True)

# Convert the integer Date to an actual date with datetime type
df["date"] = df["Date"].apply(
    lambda x: pd.Timestamp("1899-12-30") + pd.Timedelta(x, unit="days")
)

# Create a timestamp from the integer Period representing 30 minute intervals
df["date_time"] = df["date"] + \
    pd.to_timedelta((df["Period"] - 1) * 30, unit="m")

df.dropna(inplace=True)

# Rename columns
df = df[["date_time", "OperationalLessIndustrial"]]

df.columns = ["date_time", "demand"]

# Resample to hourly
df = (
    df.set_index("date_time")
    .resample("h")
    .agg({"demand": "sum"})
)

print(df.head())

Here, we see the first rows of data:

                          demand
date_time
2002-01-01 00:00:00  6919.366092
2002-01-01 01:00:00  7165.974188
2002-01-01 02:00:00  6406.542994
2002-01-01 03:00:00  5815.537828
2002-01-01 04:00:00  5497.732922

We’ll predict the next 3 hours of energy demand. We’ll use direct forecasting. Let’s create the target variable:

horizon = 3
y = pd.DataFrame(index=df.index)
for h in range(horizon):
    y[f"h_{h}"] = df.shift(periods=-h, freq="h")
y.dropna(inplace=True)
df = df.loc[y.index]
print(y.head())

This is our target variable:

                             h_0          h_1          h_2
date_time
2002-01-01 00:00:00  6919.366092  7165.974188  6406.542994
2002-01-01 01:00:00  7165.974188  6406.542994  5815.537828
2002-01-01 02:00:00  6406.542994  5815.537828  5497.732922
2002-01-01 03:00:00  5815.537828  5497.732922  5385.851060
2002-01-01 04:00:00  5497.732922  5385.851060  5574.731890

Next, we split the data into a training set and a test set:

end_train = '2014-12-31 23:59:59'
X_train = df.loc[:end_train]
y_train = y.loc[:end_train]

begin_test = '2014-12-31 17:59:59'
X_test  = df.loc[begin_test:]
y_test = y.loc[begin_test:]

Next, we set up LagFeatures and WindowFeatures to create features from lags and windows:

lagf = LagFeatures(
    variables=["demand"],
    periods=[1, 3, 6],
    missing_values="ignore",
    drop_na=True,
)


winf = WindowFeatures(
    variables=["demand"],
    window=["3h"],
    freq="1h",
    functions=["mean"],
    missing_values="ignore",
    drop_original=True,
    drop_na=True,
)

We wrap the lasso regression within the multioutput regressor to predict multiple targets:

lasso = MultiOutputRegressor(Lasso(random_state=0, max_iter=10))

Now, we assemble Pipeline:

pipe = make_pipeline(lagf, winf, lasso)

print(pipe)

The steps’ names were assigned automatically:

Pipeline(steps=[('lagfeatures',
                 LagFeatures(drop_na=True, missing_values='ignore',
                             periods=[1, 3, 6], variables=['demand'])),
                ('windowfeatures',
                 WindowFeatures(drop_na=True, drop_original=True, freq='1h',
                                functions=['mean'], missing_values='ignore',
                                variables=['demand'], window=['3h'])),
                ('multioutputregressor',
                 MultiOutputRegressor(estimator=Lasso(max_iter=10,
                                                      random_state=0)))])

Let’s fit the Pipeline:

pipe.fit(X_train, y_train)

Now, we can make forecasts for the test set:

forecast = pipe.predict(X_test)

forecasts = pd.DataFrame(
    pipe.predict(X_test),
    columns=[f"step_{i+1}" for i in range(3)]

)

print(forecasts.head())

We see the 3 hr ahead energy demand prediction for each hour:

        step_1       step_2       step_3
0  8031.043352  8262.804811  8484.551733
1  7017.158081  7160.568853  7496.282999
2  6587.938171  6806.903940  7212.741943
3  6503.807479  6789.946587  7195.796841
4  6646.981390  6970.501840  7308.359237

To learn more about direct forecasting and how to create features, check out our courses:

../../_images/fetsf.png

Feature Engineering for Time Series Forecasting#

../../_images/fwml.png

Forecasting with Machine Learning#

../../_images/feml.png

Feature Engineering for Machine Learning#











../../_images/cookbook.png

Python Feature Engineering Cookbook#














Both our book and course are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting Sole, the main developer of Feature-engine.